<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Hewi's Blog]]></title><description><![CDATA[Hewi's Blog]]></description><link>https://hewi.blog</link><generator>RSS for Node</generator><lastBuildDate>Tue, 14 Apr 2026 09:22:38 GMT</lastBuildDate><atom:link href="https://hewi.blog/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Rate Limiting Algorithms in depth]]></title><description><![CDATA[Introduction
In today’s world, rate limiting has become essential for system stability and fairness. Whether you’re running an AI SaaS platform and want to keep free trial users in check, or protecting your API server from resource starvation, a good...]]></description><link>https://hewi.blog/rate-limiting-algorithms-in-depth</link><guid isPermaLink="true">https://hewi.blog/rate-limiting-algorithms-in-depth</guid><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[rate-limiting]]></category><category><![CDATA[General Programming]]></category><category><![CDATA[architecture]]></category><category><![CDATA[System Design]]></category><category><![CDATA[System Architecture]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sun, 31 Aug 2025 21:27:02 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756675551006/242e6e62-01f0-4141-875d-b546fb8ad464.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>In today’s world, rate limiting has become essential for system stability and fairness. Whether you’re running an AI SaaS platform and want to keep free trial users in check, or protecting your API server from resource starvation, a good rate limiting strategy ensures your service stays reliable under pressure. In this article, I’ll walk you through the most widely used algorithms that power rate limiting in modern systems.</p>
<p>We’ll explore each algorithm in depth, covering its pros and cons. Let’s dive in</p>
<h1 id="heading-token-bucket-algorithm">Token Bucket Algorithm</h1>
<p>Imagine this: we have a bucket, and every <strong>T seconds</strong> a token is dropped into it. Each incoming request needs to grab a token from the bucket in order to pass through. If the bucket is empty, requests have to wait until new tokens arrive. Because the bucket can hold multiple tokens, clients are allowed to make short bursts of requests — but over time, the rate is capped by how quickly tokens refill.</p>
<p>That’s token bucket algorithm in a nutshell.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756667165971/52c7ce5a-6a0c-4fce-b6db-a206e6150dda.gif" alt class="image--center mx-auto" /></p>
<p>When the bucket is empty, incoming requests can’t proceed. Most implementations reject them immediately (think <code>429 Too Many Requests</code>). Others choose to queue requests until tokens are available, but that’s not part of the core Token Bucket algorithm — it’s an extra design decision depending on whether you value strict protection or smoother user experience.</p>
<h2 id="heading-pros">Pros:</h2>
<ol>
<li><p>Simple concept to grasp and implement</p>
</li>
<li><p>Flexible as you can control the refill speed and bucket capacity</p>
</li>
<li><p>Throughput of an API is capped by the token refill rate</p>
</li>
</ol>
<h2 id="heading-cons">Cons:</h2>
<ol>
<li><p><strong>Burst allowance can be risky</strong>: If many tokens are available (e.g., 50), the algorithm allows all 50 requests in an instant — which can look like a traffic spike (“thundering herd”) and overload downstream systems.</p>
</li>
<li><p><strong>Stateful per client</strong>: Requires tracking a bucket for every user/client/IP, which can add memory overhead at scale.</p>
</li>
<li><p><strong>Queuing isn’t built-in</strong>: If you want to delay rather than drop requests, you need extra queuing infrastructure.</p>
</li>
</ol>
<h1 id="heading-leaky-bucket-algorithm">Leaky Bucket Algorithm</h1>
<p>This algorithm approaches rate limiting from a different perspective. Imagine the same bucket, but this time it has a tiny hole at the bottom, leaking water at a constant rate. Each incoming request is like a drop of water added to the bucket. As long as there’s space, the request goes in and eventually drips out at the steady leak rate. But once the bucket is full, any additional drops (requests) simply spill over and get dropped.</p>
<p>Code wise the steady flow is enforced by a <strong>timer</strong> (or tick) that “releases” requests at a fixed interval. It is literally like a buffer. You can fully control the rate of leaking and size of bucket.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756668291028/5f10c6b7-91dd-4527-9f04-881ee2bd5e3d.gif" alt class="image--center mx-auto" /></p>
<h2 id="heading-pros-1">Pros:</h2>
<ol>
<li><p>Request steadily flow into our system which solves the bursting issue in the token bucket algorithm.</p>
</li>
<li><p>It gives a strong back pressure by rejecting any requests once the bucket is full.</p>
</li>
<li><p>Simple to undertand and implement</p>
</li>
</ol>
<h2 id="heading-cons-1">Cons:</h2>
<ol>
<li><p>Increased response time as requests are getting processed in a steady flow</p>
</li>
<li><p>Anti burst, this was a pro but can be a con depending on the use case.</p>
</li>
</ol>
<h1 id="heading-fixed-window-counter">Fixed Window Counter</h1>
<p>Now we move on to a completely different analogy: this one is all about <strong>time</strong>. Imagine a window of fixed length — say one minute, from 1:00 to 1:59. Within that window, only a certain number of requests are allowed through, let’s say 100. Once the cap is reached, any additional requests during that same window are rejected outright. When the next window starts (2:00 to 2:59), the counter resets, and the process repeats.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1670072460996/8VYwSPQac.png?auto=compress,format&amp;format=webp" alt /></p>
<p>As we can see in the image above once the requests exceed the dotted line they get discarded.</p>
<p>However, this algorithm has one major disadvantage; if we look at the image below</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1670072546451/aSRxcbfGe.png?auto=compress,format&amp;format=webp" alt /></p>
<p>The main flaw of the fixed window approach shows up when requests cluster around the <strong>boundary</strong>. For example:</p>
<ul>
<li><p>Between <strong>1:30 and 2:00</strong>, a client sends 100 requests (the max).</p>
</li>
<li><p>Then between <strong>2:00 and 2:30</strong>, they send another 100.</p>
</li>
</ul>
<p>Both windows independently look fine — each one respects the “100 requests per minute” limit.</p>
<p>But if you look at the overlapping interval <strong>1:30–2:30</strong> (which is still one minute long), the client actually made <strong>200 requests</strong>. This breaks the intended rate limit and can overload your system.</p>
<h2 id="heading-pros-2">Pros:</h2>
<ol>
<li><p>Easiest of all rate-limit algorithms to implement (just a counter and a reset timer).</p>
</li>
<li><p>No need to store request timestamps or scan logs.</p>
</li>
<li><p>Easy to communicate (“100 requests per minute”) and easy for clients to reason about.</p>
</li>
</ol>
<h2 id="heading-cons-2">Cons:</h2>
<ol>
<li><p>Requests clustered around window edges can exceed the limit in any rolling interval (e.g., 200 requests in 60 seconds instead of 100).</p>
</li>
<li><p>Bursty clients can exploit reset points, while evenly spaced clients are constrained more strictly.</p>
</li>
</ol>
<h1 id="heading-sliding-window-log">Sliding Window Log</h1>
<p>This algorithm fixes the problem with the previous one. Instead of counting requests inside <strong>aligned</strong> windows (e.g., 2:00–2:59), we keep a <strong>timestamped log</strong> of <em>each</em> accepted request.</p>
<p>Let’s say we set a rule: <strong>maximum 5 requests per 10 seconds</strong>. Instead of fixed one-minute blocks, this window is always <em>moving forward in time</em>.</p>
<p>Each time a new request arrives, we do two things:</p>
<ol>
<li><p><strong>Prune old requests</strong> → remove any logged timestamps that are older than <code>now − 10 seconds</code>.</p>
</li>
<li><p><strong>Count the remaining requests</strong> → this gives us how many requests were made in the <em>current rolling window</em>.</p>
</li>
</ol>
<p>If the count is still below 5, the new request is allowed and added to the log. If the count is already at 5, the new request is rejected.</p>
<p>In this way, the algorithm enforces the rule across <em>any 10-second span</em>, not just neat boundaries like <code>0–10</code> or <code>10–20</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756671721995/8023279b-59d2-48b2-ba66-a90020366345.gif" alt class="image--center mx-auto" /></p>
<h2 id="heading-pros-3">Pros:</h2>
<ol>
<li><p>Enforces “no more than N requests in the last W seconds” with no fixed-window edge cases.</p>
</li>
<li><p>Boundary exploits (end-of-window spikes) don’t slip through.</p>
</li>
</ol>
<h2 id="heading-cons-3">Cons:</h2>
<ol>
<li><p>Memory heavy as it stores one timestamp per accepted request (hot clients = big logs).</p>
</li>
<li><p>CPU overhead where it must prune old timestamps on every request.</p>
</li>
<li><p>Heavy per-client load can bottleneck a single store key.</p>
</li>
</ol>
<h1 id="heading-sliding-window-counter">Sliding Window Counter</h1>
<p>Think of it as a <strong>compromise</strong> between Fixed Window (too sloppy) and Sliding Log (too heavy).</p>
<p>Instead of tracking the whole window which can be heavy (store a lot of logs for 60 second window for example)</p>
<p>So we split it into <strong>buckets,</strong> so a 60 second window will have 60 buckets each of 1 second.</p>
<p>For example if we require maximum of <strong>10 requests per 60 seconds;</strong></p>
<ul>
<li><p>At <strong>12:00:00–12:00:01</strong>, we got <strong>7 requests</strong> (in bucket A).</p>
</li>
<li><p>At <strong>12:00:01–12:00:02</strong>, we already have <strong>6 requests</strong> so far (in bucket B).</p>
</li>
<li><p>A new request comes in at <strong>12:00:01.5</strong> (halfway into bucket B).</p>
</li>
</ul>
<p>And we want to know: “How many requests happened in the last 60s?”</p>
<p>In the Sliding Window Counter, we look at all the buckets that overlap with the rolling window.</p>
<ul>
<li><p>Buckets that are <strong>fully inside</strong> the window are counted with a weight of <strong>1.0</strong> (their full value).</p>
</li>
<li><p>Buckets that are only <strong>partially inside</strong> the window — usually just the oldest and the newest — are counted with a fractional weight equal to how much of them overlaps with the window.</p>
</li>
</ul>
<p>So if it overlapped 4 buckets 2 of them being fully inside the window; the calculation is as follows:</p>
<pre><code class="lang-plaintext">1.0 * (fully inside bucket A request count) 
+ 1.0 * (fully inside bucket B request count) 
+ oldest fraction with window * (oldest bukcet request count)
+ newest fraction with window * (newest bucket request count)
</code></pre>
<p>If this total sum is less than the threshold (10 requests for 60 seconds) we allow it otherwise we reject.</p>
<blockquote>
<p>At first glance, the Sliding Window Counter sounds neat — split time into buckets, weight the edges, sum them all up. But think about it: if your window is 60 seconds with 1-second buckets, that’s <strong>60 buckets to check</strong> on every request. Bump that to a 5-minute window with millisecond precision and you’re suddenly tracking <strong>300,000 buckets</strong>. Looking at <em>every single bucket</em> quickly becomes wasteful. The math is overkill, and the per-request overhead will crush performance at scale.</p>
</blockquote>
<p>In practice, we don’t need to scan all the buckets. Notice that:</p>
<ul>
<li><p><strong>Middle buckets</strong> are either <strong>fully inside</strong> or <strong>fully outside</strong> the window → no weighting needed.</p>
</li>
<li><p>Only the <strong>oldest</strong> and <strong>newest</strong> buckets overlap partially and require fractional weights.</p>
</li>
</ul>
<p>So instead of summing hundreds (or thousands) of buckets, we can keep:</p>
<ol>
<li><p>A <strong>running total</strong> of all requests.</p>
</li>
<li><p>Adjust only the <strong>two edge buckets</strong> for overlap.</p>
</li>
</ol>
<p>This way, each request check is <strong>O(1)</strong> — constant time regardless of how long your window is or how fine-grained your buckets are.</p>
<h2 id="heading-pros-4">Pros:</h2>
<ol>
<li><p>Much lighter than Sliding Log no per-request timestamps, just bucket counters.</p>
</li>
<li><p>Fairer than Fixed Window smooths out boundary spikes by blending across buckets.</p>
</li>
<li><p>O(1) runtime per request, scales very well</p>
</li>
</ol>
<h2 id="heading-cons-4">Cons:</h2>
<ol>
<li><p>Approximate, not exact as counts are close but not perfect (depending on bucket size).</p>
</li>
<li><p>Complexity <strong>as</strong> implementation is trickier than Fixed Window or Token Bucket.</p>
</li>
</ol>
<h1 id="heading-summary">Summary</h1>
<p>Rate limiting is essential in modern applications, and the algorithm you choose should depend on your business model and what you expect your limiter to achieve. Whether you need strict fairness, lightweight performance, or burst tolerance, there’s a strategy that fits. Choose wisely — and happy coding!</p>
]]></content:encoded></item><item><title><![CDATA[Dump and Restore FOR PROCESSES?]]></title><description><![CDATA[Introduction
What’s going on everyone! I stumbled upon a very interesting project & have been playing around with it for the past couple of days, we all know about dumping and restoring data right? whether its databases or even raw files. However did...]]></description><link>https://hewi.blog/dump-and-restore-for-processes</link><guid isPermaLink="true">https://hewi.blog/dump-and-restore-for-processes</guid><category><![CDATA[Linux]]></category><category><![CDATA[Docker]]></category><category><![CDATA[Devops]]></category><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Fri, 27 Jun 2025 15:05:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1751036663905/c20595e1-414e-46ef-b24e-db12987b90d4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>What’s going on everyone! I stumbled upon a very interesting project &amp; have been playing around with it for the past couple of days, we all know about dumping and restoring data right? whether its databases or even raw files. However did you know that there was a way to checkpoint a running process and restore it later on? 🤯</p>
<p>Imagine this a long running task and we wanted to move it to a different VM for example, we could simply pause the execution saving every single detail of the process from instruction pointers to stack pointers, all memory, etc and restore it so it continues as if nothing happened. Here’s where <a target="_blank" href="https://github.com/checkpoint-restore/criu">CRIU</a> shines.</p>
<h1 id="heading-criu">CRIU</h1>
<p><strong>CRIU</strong> (Checkpoint and Restore In Userspace) lets you <strong>freeze a running Linux process</strong>, save its entire state to disk, and <strong>restore it later</strong> — like nothing ever happened.</p>
<p>It works mostly in <strong>userspace</strong>, and supports complex features like open files, memory, TCP connections, and more.</p>
<p>🔗 <a target="_blank" href="http://criu.org/">criu.org</a> has all the docs and examples to get started.</p>
<h1 id="heading-demo">Demo</h1>
<p>I wanted to try something using this tool and i’ll demo it here, Imagine having a web server that takes in a request and takes some time to process the request. Will dumping &amp; restoring whilst the request is processing work &amp; actually return the response back? The answer is yes but let’s go into detail at what exactly happens to achieve this.</p>
<h2 id="heading-installing-criu">Installing CRIU</h2>
<p>CRIU <strong>doesn’t work on macOS</strong> because it relies on <strong>Linux kernel features</strong> that <strong>macOS doesn’t have</strong> — and likely never will.</p>
<p>However I just spun an Ubuntu VM using digital ocean to get this demo done.</p>
<p>CRIU supports until Ubuntu 22.04, anything after that not yet. If you have an ubuntu version above 22.04 it won’t work.</p>
<p>We can use <code>apt</code> package manager to install it as follows</p>
<pre><code class="lang-bash">sudo add-apt-repository ppa:criu/ppa
sudo apt update
sudo apt install criu
</code></pre>
<p>Once installed we can verify using <code>criu —version</code></p>
<h2 id="heading-simple-counter-program-to-test-the-commands">Simple counter program to test the commands</h2>
<p>Before moving on to the web server thing, I wrote a bash script that basically prints counts &amp; sleeps between each iteration.</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>

i=1
<span class="hljs-keyword">while</span> <span class="hljs-literal">true</span>; <span class="hljs-keyword">do</span>
  <span class="hljs-built_in">echo</span> <span class="hljs-string">"Count: <span class="hljs-variable">$i</span>"</span> &gt;&gt; /tmp/count.log
  sleep 2
  ((i++))
<span class="hljs-keyword">done</span>
</code></pre>
<p>Run this via <code>./count.sh &amp;</code> and the <code>&amp;</code> is to make it run in the background.</p>
<p>If we tail <code>/tmp/count.log</code> we can see that it prints counts</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751033270735/ca7cee37-130d-434e-993a-2e654d56e624.gif" alt class="image--center mx-auto" /></p>
<p>Now let’s try the command to checkpoint from CRIU</p>
<p>First we get the PID via <code>pgrep -f count</code> → <code>11786</code></p>
<p>Then <code>sudo criu dump -t 11786 -D /tmp/checkpoint --shell-job</code></p>
<p>Make sure a directory exists at <code>/tmp/checkpoint</code></p>
<p>The command above will checkpoint &amp; save the process status as files in the directory <code>/tmp/checkpoint</code></p>
<p>it will Freeze, Checkpoint &amp; kill the process.</p>
<p><code>—-shell-job</code> flag is used here because It allows CRIU to checkpoint and restore processes that:</p>
<ul>
<li><p>Are <strong>attached to a terminal</strong></p>
</li>
<li><p>Were started from a <strong>shell</strong></p>
</li>
<li><p>Have a <strong>controlling terminal</strong></p>
</li>
</ul>
<p>This is to do with detaching it from the terminal’s process &amp; session groups so it doesn’t forward any signals to the process.</p>
<p>Now the process stops completely, in fact it gets killed. But we can restore using:</p>
<p><code>sudo criu restore -t 11786 -D /tmp/checkpoint --shell-job</code></p>
<p>On restoring the count will begin to pick back up again. However if you restore multiple times it’s always going to start from the checkpoint it initially took the first time (at count 50 for example) even if it goes further on continuing it (you’re going to have to dump again).</p>
<p>Now let’s create a simple python web server than listens to a request and takes 10 seconds to process it and see what happens 👀</p>
<h2 id="heading-python-webserver">Python Webserver</h2>
<p>First install python3</p>
<p><code>sudo apt install python3 python3-pip</code></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> http.server <span class="hljs-keyword">import</span> BaseHTTPRequestHandler, HTTPServer
<span class="hljs-keyword">import</span> time

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">DelayedHandler</span>(<span class="hljs-params">BaseHTTPRequestHandler</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">do_GET</span>(<span class="hljs-params">self</span>):</span>
        time.sleep(<span class="hljs-number">10</span>)
        self.send_response(<span class="hljs-number">200</span>)
        self.end_headers()
        self.wfile.write(<span class="hljs-string">b"Done after checkpoint!"</span>)

server = HTTPServer((<span class="hljs-string">'0.0.0.0'</span>, <span class="hljs-number">8080</span>), DelayedHandler)
print(<span class="hljs-string">"Starting server on port 8080..."</span>)
server.serve_forever()
</code></pre>
<p>Run using <code>python3</code> <a target="_blank" href="http://webserver.py"><code>webserver.py</code></a> <code>&amp;</code></p>
<p>Let’s check if its working using curl</p>
<pre><code class="lang-python">curl  http://&lt;VM-IP&gt;:<span class="hljs-number">8080</span>
Done after checkpoint!%
</code></pre>
<p>Now let’s run a request, dump &amp; checkpoint and see what happens</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1751035392863/0c64a2e5-f810-43fd-96d1-e44ec183c72d.gif" alt class="image--center mx-auto" /></p>
<p>We can dump &amp; restore whilst in the middle of a request! When dumping the <code>tcp-established</code> flag make sure to preserve anything related to the tcp sockets in the process. It pauses them and the client has no idea what Is happening.</p>
<p>This is easy on the same host because they have the same IP address of course but if it were two different hosts we need to make sure that the addresses resolve to the other host otherwise it will fail.</p>
<h1 id="heading-summary">Summary</h1>
<p>This opens up room for endless ideas that could happen &amp; docker uses CRIU for container migration where you would want to migrate a container from one place to another. This has been a small but dense article hope you enjoyed &amp; see you in the next one!</p>
]]></content:encoded></item><item><title><![CDATA[Diving into the DevOps world: How cool is log rotating?]]></title><description><![CDATA[Introduction
What is happening everyone! Hope you’re all having an amazing start of the summer. In this article i’m diving into something I recently spent time reading about and was fascinated about how cool & useful it is.
In systems that produce lo...]]></description><link>https://hewi.blog/diving-into-the-devops-world-how-cool-is-log-rotating</link><guid isPermaLink="true">https://hewi.blog/diving-into-the-devops-world-how-cool-is-log-rotating</guid><category><![CDATA[Devops]]></category><category><![CDATA[backend]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sun, 08 Jun 2025 12:40:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1749386416723/47bc18d7-905a-4d72-b847-cb223d72caba.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>What is happening everyone! Hope you’re all having an amazing start of the summer. In this article i’m diving into something I recently spent time reading about and was fascinated about how cool &amp; useful it is.</p>
<p>In systems that produce logs whether it’s a web server, database, background job, etc the logs tend - overtime - to grow in size. For example a web server that writes to a log file every request would grow linearly as the users grow.</p>
<p>Over time this would significantly affect the disk size which overtime could cause the whole system to be slower. (peep the next article to know why)</p>
<p>The disk needed a hero, enter log rotation</p>
<p>Log rotation basically controls the whole process and prevents files from getting bigger (makes everything more controllable)</p>
<p>You can fully control how you want the logs to be rotated. Either every fixed period of time or once the file reaches a certain size.</p>
<p>The older log files then can be compressed to reduce its size &amp; maybe send to some archive to free up that disk space. You can also control how many log files you can have before it starts deleting them. So to summarize;</p>
<p><strong>Log rotation is a process or strategy</strong> that:</p>
<ul>
<li><p>Limits log file size or age</p>
</li>
<li><p>Archives old logs (optionally compressing them)</p>
</li>
<li><p>Starts fresh logs for continued logging</p>
</li>
<li><p>Enforces a retention policy (e.g. keep last 7 logs)</p>
</li>
</ul>
<p>Now that we understand what rotation is, let’s look at one of the most famous tools used for log rotation</p>
<h1 id="heading-logrotate">LogRotate</h1>
<p>Log rotate is one of the most popular and widely used log rotation tools, especially on <strong>Linux systems</strong>. It’s the de facto standard for rotating log files created by system services, daemons, and applications.</p>
<p>We’ll dive into a quick demo rotating logs for a service that produces logs nonstop</p>
<ul>
<li><p>To install on Mac <code>brew install logrotate</code></p>
</li>
<li><p>Ubuntu: <code>sudo apt install logrotate</code></p>
</li>
</ul>
<p>Once installed. The main entry point for the binary is the <code>logrotate.conf</code> file. On Mac It’s located in <code>/opt/homebrew/etc/logrotate.conf</code></p>
<p>If we take a look at it we’ll find the following</p>
<pre><code class="lang-bash"><span class="hljs-comment"># see "man logrotate" for details</span>

<span class="hljs-comment"># global options do not affect preceding include directives</span>

<span class="hljs-comment"># rotate log files weekly</span>
weekly

<span class="hljs-comment"># keep 4 weeks worth of backlogs</span>
rotate 4

<span class="hljs-comment"># create new (empty) log files after rotating old ones</span>
create

<span class="hljs-comment"># uncomment this if you want your log files compressed</span>
<span class="hljs-comment">#compress</span>

<span class="hljs-comment"># packages drop log rotation information into this directory</span>
include /opt/homebrew/etc/logrotate.d

<span class="hljs-comment"># system-specific logs may also be configured here.</span>
</code></pre>
<ul>
<li><p><code>weekly</code> rotates log files every week (not by itself a cronjob is required to run the logrotate binary but it looks a week behind and rotates), There exists different options <code>daily</code> <code>hourly</code> &amp; more</p>
</li>
<li><p><code>rotate 4</code> keeps the last 4 rotations and deletes the rest.</p>
</li>
<li><p><code>create</code> creates new log files after rotating, be careful when using this because any process dumping logs has an open file descriptor to the log file, when a new one is created you’re going to have to manually update the process to point to the new one. <code>copytruncate</code> is a life saver here because it copies the logs to a new file and empties the original file. Meaning the file descriptor still points to that file.</p>
</li>
<li><p><code>compress</code> uses gzip to compress the rotated files to reduce size</p>
</li>
<li><p><code>include /opt/homebrew/etc/logrotate.d</code> basically is for application management where in the <code>logrotate.d</code> directory you can create different configurations for different applications each customized with their own set of settings. (We’ll dive into more below)</p>
</li>
<li><p>Example of a simple app I created <code>/opt/homebrew/etc/logrotate.d/myapp</code></p>
<pre><code class="lang-bash">  /var/<span class="hljs-built_in">log</span>/myapp/*.<span class="hljs-built_in">log</span> <span class="hljs-comment"># specify the path of the logs to operate on</span>

  {
  size 5k <span class="hljs-comment"># Rotate when size is 5 kilobytes, K is kilobytes, M is megabytes, etc (was testing that's why its low) </span>
  copytruncate <span class="hljs-comment"># copytruncate basically instead of creating a new file copies the data from the file to another file then empties it</span>
  compress <span class="hljs-comment"># gzip old logs</span>
  rotate 3 <span class="hljs-comment"># keep last 3 rotations</span>
  missingok <span class="hljs-comment"># if no logs are found don't panic</span>
  notifempty <span class="hljs-comment"># don't rotate if the log is empty.</span>
  }
</code></pre>
</li>
</ul>
<p>For all the configurations you can visit the man page for log rotate <a target="_blank" href="https://man7.org/linux/man-pages/man8/logrotate.8.html#:~:text=size%20size%20Log%20files%20are,size%20100G%20are%20all%20valid.">here</a></p>
<p>But with the setup above I have a simple script that writes logs</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>

logfile=<span class="hljs-string">"/var/log/myapp/production.log"</span>

<span class="hljs-comment"># Function to generate a random log record</span>
<span class="hljs-function"><span class="hljs-title">generate_log_record</span></span>() {
    <span class="hljs-built_in">local</span> loglevel=(<span class="hljs-string">"INFO"</span> <span class="hljs-string">"WARNING"</span> <span class="hljs-string">"ERROR"</span>)
    <span class="hljs-built_in">local</span> services=(<span class="hljs-string">"web"</span> <span class="hljs-string">"database"</span> <span class="hljs-string">"app"</span> <span class="hljs-string">"network"</span>)
    <span class="hljs-built_in">local</span> timestamps=$(date +<span class="hljs-string">"%Y-%m-%d %H:%M:%S"</span>)
    <span class="hljs-built_in">local</span> random_level=<span class="hljs-variable">${loglevel[$RANDOM % ${#loglevel[@]}</span>]}
    <span class="hljs-built_in">local</span> random_service=<span class="hljs-variable">${services[$RANDOM % ${#services[@]}</span>]}
    <span class="hljs-built_in">local</span> message=<span class="hljs-string">"This is a sample log record for <span class="hljs-variable">${random_service}</span> service."</span>

    <span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">${timestamps}</span> [<span class="hljs-variable">${random_level}</span>] <span class="hljs-variable">${message}</span>"</span>
}

<span class="hljs-comment"># Main loop to write log records every second</span>
<span class="hljs-keyword">while</span> <span class="hljs-literal">true</span>; <span class="hljs-keyword">do</span>
    log_record=$(generate_log_record)
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">${log_record}</span>"</span> &gt;&gt; <span class="hljs-string">"<span class="hljs-variable">${logfile}</span>"</span>
    sleep 1
<span class="hljs-keyword">done</span>
</code></pre>
<p>Now we have:</p>
<ol>
<li><p>Log rotate setup for our application</p>
</li>
<li><p>Logs being created</p>
</li>
</ol>
<p>But we need a way to actually invoke the <code>logrotate</code> command because it doesn’t on its own and here we have two options:</p>
<ul>
<li><p>A cron job that runs every specified period</p>
</li>
<li><p>Using a file watcher that watches file sizes and acts accordingly</p>
</li>
</ul>
<p>In my case since I added the <code>size</code> config i’ll use a file watcher. On Mac I used <a target="_blank" href="https://emcrisostomo.github.io/fswatch/"><code>fswatch</code></a></p>
<pre><code class="lang-bash">fswatch -0 /var/<span class="hljs-built_in">log</span>/myapp/production.log | <span class="hljs-keyword">while</span> <span class="hljs-built_in">read</span> -d <span class="hljs-string">""</span> event
<span class="hljs-keyword">do</span>
  size=$(<span class="hljs-built_in">stat</span> -f%z /var/<span class="hljs-built_in">log</span>/myapp/production.log)
  <span class="hljs-keyword">if</span> [ <span class="hljs-string">"<span class="hljs-variable">$size</span>"</span> -ge 5120 ]; <span class="hljs-keyword">then</span>
    sudo logrotate -f /opt/homebrew/etc/logrotate.conf
  <span class="hljs-keyword">fi</span>
<span class="hljs-keyword">done</span>
</code></pre>
<p>Basically watch for the size of <code>production.log</code> once its more than 5 KB (small for testing) we rotate.</p>
<p>We can watch it in action to:</p>
<p>Watch how <code>production.log</code> reaches 5KB and rotates to be <code>production.log.3.gz</code> and the existing <code>production.log.3.gz</code> gets deleted</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749385674956/8afef282-ab06-4986-9bd7-32f7eb419de0.gif" alt class="image--center mx-auto" /></p>
<p>You can do so much more with rotations, add scripts that run on every rotate maybe upload the rotated files to some cloud storage, etc. Endless ideas can be made here.</p>
<h1 id="heading-summary">Summary</h1>
<p>Rotation is cool.</p>
]]></content:encoded></item><item><title><![CDATA[Finding a Needle in a Haystack: How to Diff 800M+ Records Across Two Databases Without Losing Your Mind]]></title><description><![CDATA[Introduction
Hello guys! This is going to be a quick but interesting one. In design sometimes for faster response times & aggregation purposes we step away from the traditional relational databases and head over towards more analytical processing opt...]]></description><link>https://hewi.blog/finding-a-needle-in-a-haystack-how-to-diff-800m-records-across-two-databases-without-losing-your-mind</link><guid isPermaLink="true">https://hewi.blog/finding-a-needle-in-a-haystack-how-to-diff-800m-records-across-two-databases-without-losing-your-mind</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[software development]]></category><category><![CDATA[ClickHouse]]></category><category><![CDATA[SQL]]></category><category><![CDATA[scalability]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sat, 10 May 2025 12:03:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1746878558524/33910067-857f-4281-8e3c-e027eca04e9c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Hello guys! This is going to be a quick but interesting one. In design sometimes for faster response times &amp; aggregation purposes we step away from the traditional relational databases and head over towards more analytical processing optimized databases (e.g Clickhouse which Is a columnar data store)</p>
<p>Maintaining a source of truth database is important and that will most probably be the relational one. We opt into syncing both either synchronously or asynchronously where data in Postgres for example is always passed over to Clickhouse which we can directly read and do aggregates and such.</p>
<p>I’m not here to talk about how to sync them together but the main goal is to keep both databases in sync. Both have the same exact data either right now or eventually but that’s a different story.</p>
<p>Sometimes they’ll deviate. Meaning Postgres has more data than Clickhouse. That could happen because either some messages weren’t processed as they should have or network errors, etc</p>
<p>Now you figured out the reason and pushed a fix, but you want to re-sync the missing parts again and the next question arises..</p>
<blockquote>
<p>How do we sync the missing data again?</p>
</blockquote>
<p>The answer of this will be depending on the scale of the data provided. And let me tell you from experience that that’s the most important part necessary to answer the question.</p>
<p>I’ll explain the next parts in a series of questions and answers and hopefully the answers help you reach a conclusion by the end of the article</p>
<h1 id="heading-how-big-is-the-data">How big is the data?</h1>
<p>Data sizes can range from a few hundred thousand rows to a few million to almost billions of rows.</p>
<p>Smaller ranges have more options than larger ranges. Meaning it’s easier and takes less time to resync when you don’t have a lot of rows.</p>
<p>Larger ranges are where things get tough. You’ll reach a point where you start questioning yourself but hey everything is a learning curve I guess.</p>
<p>After figuring this out we opt into asking the second question</p>
<h1 id="heading-how-big-is-the-delta">How big is the delta?</h1>
<p>The difference between both databases also and the ratio between the delta and the size of the data is important to know.</p>
<p>Let me tell you before moving on that having a difference of 4000 records in a data size of 1b+ rows is something that’ll cause you headaches. This article is more aimed at that scale of data.</p>
<p>Now knowing the answers of the above two questions can already derive you to a solution</p>
<h1 id="heading-small-ish-data-sizes">Small-ish data sizes</h1>
<p>Small amounts of data (up to a few hundred thousand) can be easily re-synced. The most straight forward option is to Fetch the ids (any unique key really) here and there and just get the difference between both. Will take some memory but not a crazy amount. Once having the delta just re insert them into your analytical database and move on with your day.</p>
<h1 id="heading-larger-data-sizes">Larger data sizes</h1>
<p>Here is where the above solution will NEVER work under normal circumstances (unless you have a whopping 100gb RAM machine or something)</p>
<p>I’m going to stick with the 4000 records over 800Mil + records problem here. (We love extremes) And the solutions provided for the questions below are steps I took to actually solve this problem.</p>
<p>Now you’re going to have to play it carefully and ask new questions and these questions depend on the database optimization level you have and setup.</p>
<p>The setup we’ll assume is that there exists no partitioning no sharding just indexes and vibes. (Not the best setup for this amount of data)</p>
<p>The main goal here is to stay away from memory. And by staying away from memory I mean any application level logic you attempt will end up failing miserably.</p>
<p><strong>Some questions to ask</strong></p>
<ol>
<li>Is there a possibility to close the searching gap down? for example do I have to search from the first record to the last or can I close that gap a little bit? Indexes will help but still take a long time.</li>
</ol>
<p>For example if I figured out that the 4000 records got dropped only in the past month then I can utilize some timestamp (hopefully indexed) and work around that for a start.</p>
<p>So the first task is to <strong>minimize the search field as much as you can</strong></p>
<ol start="2">
<li>What are the strengths of each database? Knowing this can help us make better decisions</li>
</ol>
<p>a- For example Clickhouse is very very strong in aggregating and crushing numbers due to its columnar nature (data is stored physically together in columns not rows). Knowing this information, querying the ids over the date range specified in step 1 is key (a few seconds query) but instead of moving them to memory we write them to a <code>csv</code> file on disk. Something like this</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Clickhouse query</span>
SELECT id 
FROM xxx 
WHERE xxx_date BETWEEN <span class="hljs-string">'2025-01-01'</span> AND <span class="hljs-string">'2025-01-31'</span> 
INTO OUTFILE <span class="hljs-string">'clickhouse_ids.csv'</span>
FORMAT CSV;
</code></pre>
<p>This will output all the ids that exist in that data range. They are missing 4000 records though but this is a solid step towards solving the problem.</p>
<p>b- Now in Postgres the goal is to take the ids from above and exclude them, finding the missing 4000.</p>
<p>Here comes a really cool trick I learned which is creating a <strong>Temporary Postgres table</strong></p>
<blockquote>
<p>Temporary tables are tables that get removed once the connection/session is closed. For example executing psql and creating a temp table then on exit it gets deleted</p>
</blockquote>
<p>The idea here is to create this table, dump the ids in it and use it to query against the larger table. Offloading the memory overhead to Postgres. Memory use is <strong>offloaded to Postgres’ shared buffers, disk, and planner</strong>.</p>
<p>Even if the table is big, Postgres manages the spillover efficiently, avoiding app memory pressure.</p>
<p>Now executing something like this will be ideal</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Postgres psql</span>
\copy (
  SELECT p.id
  FROM xx p
  WHERE p.xx_date BETWEEN <span class="hljs-string">'2025-01-01'</span> AND <span class="hljs-string">'2025-01-31'</span>
  AND p.id NOT IN (
    SELECT id FROM clickhouse_ids
  )
) TO <span class="hljs-string">'missing_ids.csv'</span> WITH CSV
</code></pre>
<p>Giving that we have properly indexed the table on the <code>date column</code> and the <code>id</code>. Everything falls in the hands of the planner anyway.</p>
<p>Once this query finishes though you’ll have a 4000 row csv of the missing ids. Now all that’s left is just inserting them into Clickhouse.</p>
<h1 id="heading-summary">Summary</h1>
<p>Dealing with large data is definitely an experience. The application level solutions have no power here and thinking outside the box is definitely a must. This was an approach I personally used that worked better than I expected so thought i’d share. Thank you guys for tuning in &amp; see you in the next one!</p>
]]></content:encoded></item><item><title><![CDATA[Migrating Data to AWS, lessons learned]]></title><description><![CDATA[Introduction
Hey everyone, Recently i’ve been working on migrating 1.5TB PostgreSQL worth of data to AWS from another cloud provider. I wanted to document the journey on the different attempts that were made and what did not work and what did. This i...]]></description><link>https://hewi.blog/migrating-data-to-aws-lessons-learned</link><guid isPermaLink="true">https://hewi.blog/migrating-data-to-aws-lessons-learned</guid><category><![CDATA[Databases]]></category><category><![CDATA[AWS]]></category><category><![CDATA[migration]]></category><category><![CDATA[#dms]]></category><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[software architecture]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Fri, 25 Apr 2025 12:58:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1745585838257/82b1f4c8-b037-4524-8e21-0badac562ae1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Hey everyone, Recently i’ve been working on migrating 1.5TB PostgreSQL worth of data to AWS from another cloud provider. I wanted to document the journey on the different attempts that were made and what did not work and what did. This is going to be a short one because honestly I’m out of ideas at this moment of time so forgive me 😅</p>
<h1 id="heading-context">Context</h1>
<p>Migrating the data because the goal was to move everything to AWS. So part of it was migrating the database essentially.</p>
<h1 id="heading-using-aws-dms">Using AWS DMS 💀</h1>
<p>The first attempt was to migrate the data using AWS’s data migration tool, and let me tell you a mistake that happened before everything is we should have spiked it out a couple of days before making this decision.</p>
<p>We started moving data using the DMS tool, so basically just migrating data from one place to another. At first we had no idea it doesn’t copy the schema so prepare yourself for the first headache</p>
<h2 id="heading-headache-1-theres-no-schema">Headache 1: There’s no schema 💀</h2>
<p>After everything finished we tried it out and it had the data, but we couldn’t create anything new (no sequences for ids defined), there was no indexes no constraints nothing. Then we realized that the DMS tool doesn’t copy schema. (wish I read about it before lol). But anyways continuing the story we decided to take a schema only dump <code>—schema-only</code> using <code>pg_dump</code> and apply it on the new db. Aaaand we ran into another problem</p>
<h2 id="heading-headache-2-default-values-for-columns">Headache 2: Default values for columns</h2>
<p>Turns out that the DMS tool creates the bare minimal schema for it to be able to transfer the data (gotta have tables to be able to insert in them lol) and when it does this it skips the default values identified in columns. When we apply the schema the create table commands (with correct default values) don’t get triggered because the tables already exist. So indexes, sequences, constraints and triggers work but the default values are gone.</p>
<p>Unless you decide to go through every single table and do alter statements to add the missing default which really isn’t best practice if you think about it.</p>
<p>This problem spiraled into us trying different techniques to make it work, some of them were the following:</p>
<ol>
<li><p>Say screw it and add the schema first and make it slower but integral.</p>
</li>
<li><p>Add the schema but disable indexes then enable afterwards</p>
</li>
<li><p>Try <code>pg_dump</code> using a <code>.sql</code> format, a <code>.dump</code> format, I don’t even remember the rest tbf</p>
</li>
</ol>
<p>And congrats guys we reached our 3rd, 4th and 5th headache of the day</p>
<h2 id="heading-headache-3-invalid-command-galore-in-sql-dumps">Headache 3: Invalid command galore in .sql dumps</h2>
<p>So we took a <code>.sql</code> dump, tried it and well a spam of <code>invalid command \N</code> started to appear. Which means the file couldn’t be parsed correctly when attempting to restore. I couldn’t care less at this point and decided to just do a <code>.dump</code> format see if the error still persists and it went completely. But another one appeared (surprise)</p>
<h2 id="heading-headache-4-pgrestore-doesnt-work-if-you-have-generated-columns">Headache 4: pg_restore doesn’t work if you have generated columns</h2>
<p>A generated column in PostgreSQL is a column that is physically stored on disk but is generated from other columns in the table, something like this</p>
<pre><code class="lang-pgsql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> users (
  first_name <span class="hljs-type">TEXT</span>,
  last_name <span class="hljs-type">TEXT</span>,
  full_name <span class="hljs-type">TEXT</span> <span class="hljs-keyword">GENERATED</span> <span class="hljs-keyword">ALWAYS</span> <span class="hljs-keyword">AS</span> (first_name || <span class="hljs-string">' '</span> || last_name) STORED
);
</code></pre>
<p>And apparently when restoring the <code>.dump</code> using <code>pg_restore</code> it worked for all the tables that don’t have generated columns but for the ones that did</p>
<pre><code class="lang-pgsql">pg_restore: error: could <span class="hljs-keyword">not</span> <span class="hljs-keyword">execute</span> query: ERROR:  <span class="hljs-keyword">column</span> "xx" <span class="hljs-keyword">is</span> a <span class="hljs-keyword">generated</span> <span class="hljs-keyword">column</span>
DETAIL:  <span class="hljs-keyword">Generated</span> <span class="hljs-keyword">columns</span> cannot be used <span class="hljs-keyword">in</span> <span class="hljs-keyword">COPY</span>.
Command was: <span class="hljs-keyword">COPY</span> foo (id, <span class="hljs-type">name</span>, ..., xx, ...) <span class="hljs-keyword">FROM stdin</span>;
</code></pre>
<p>And the copy command fails completely. Turns out u can’t copy generated columns with <code>pg_restore</code></p>
<p>Thought of dropping the columns, transferring the data and re adding the columns but that too didn't work.</p>
<h1 id="heading-what-ended-up-happening">What ended up happening</h1>
<p>We decided to go through with a <code>.dump</code> of the whole database &amp; schema. Optimizing the database as much as we can regarding insertions. Also parallelizing with <code>pg_restore</code></p>
<p>Some of the optimizations that were made were: (all in postgresql.conf file)</p>
<ol>
<li><p>Bumped shared buffers</p>
</li>
<li><p>Bumped <code>maintenance_work_mem</code> (Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY.)</p>
</li>
<li><p>Bumped <code>checkpoint_timeout</code> which is the time we go okay let’s sync actual database with where WAL is committed at (kinda goes deeper but tldr)</p>
</li>
<li><p>disable <code>fsync</code> and <code>synchronus_commit</code> Which is <strong>not production friendly,</strong> this basically skips committing WAL to disk on every write (sometimes it can get batched when on but that’s not the case)</p>
</li>
<li><p>disable <code>full_page_writes</code> which is not production friendly too, basically instead of copying the whole page to memory to alter it it just partially updates it. (basically for crash recovery safety)</p>
<p> This makes WAL much <strong>smaller</strong> and <strong>writes faster</strong>.</p>
</li>
</ol>
<p>That’s it I guess, needed to rant about this so I wrote it as an article.</p>
<p>Thanks for coming to my tech talk guys and hope you enjoyed! till the next one</p>
]]></content:encoded></item><item><title><![CDATA[Building a scalable top K using Kafka & Flink]]></title><description><![CDATA[Introduction
What’s happening everyone! In today’s article we’re going to be diving deep into creating a scalable top K list for the most liked videos in a configurable time. Top k questions are widely used in system design interviews and in real lif...]]></description><link>https://hewi.blog/building-a-scalable-top-k-using-kafka-and-flink</link><guid isPermaLink="true">https://hewi.blog/building-a-scalable-top-k-using-kafka-and-flink</guid><category><![CDATA[kafka]]></category><category><![CDATA[apache-flink]]></category><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[data structures]]></category><category><![CDATA[Databases]]></category><category><![CDATA[Startups]]></category><category><![CDATA[technology]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Wed, 02 Apr 2025 12:36:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743597278141/7e6b1692-9fe6-4342-9cb0-518d45abad0d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>What’s happening everyone! In today’s article we’re going to be diving deep into creating a scalable top K list for the most liked videos in a configurable time. Top k questions are widely used in system design interviews and in real life as they provide really valuable insights depending on what domain they are used in.</p>
<p>We’re going to be going through a brief explanation of how Flink works, what the aim of the article is and we’re going to code everything up (I’ll leave the link for the GitHub repo at the end of the article). Let’s begin.</p>
<p>Before moving forward we’re going to imagine a scenario where we were tasked with engineering a top 5 most liked videos problem which moves us to the project requirements section.</p>
<h1 id="heading-project-requirements">Project Requirements</h1>
<p>We’ve been asked by the business team that for insights the following is required:</p>
<ol>
<li><p>View latest Top 5 most liked videos on our platform (Last 5 minutes for example, refreshes every minute)</p>
<ul>
<li>We have a lot of likes coming in per second (around 200~400k/sec)</li>
</ul>
</li>
<li><p>Display them in some fancy frontend</p>
</li>
<li><p>It’s okay if data skews a couple of seconds maximum</p>
</li>
</ol>
<p><strong>Out of Scope for now:</strong></p>
<ol>
<li>Keeping track of historical data for periods of time (we just want the live right now)</li>
</ol>
<p>Now before moving into how we’ll design this thing, let me talk briefly about Flink</p>
<h1 id="heading-flink-brief-intro">Flink Brief Intro</h1>
<p>I’ve wrote an article on Flink before if you’re interested in the deep dives <a target="_blank" href="https://hewi.blog/white-paper-summaries-apache-flink">here</a></p>
<p>But for now i’ll give you what you need to know for this article.</p>
<h2 id="heading-what-is-flink">What is Flink?</h2>
<p>Apache Flink is an open-source system for processing streaming and batch data. It’s highly scalable and fault tolerant and superior in handling massive amounts of data. It’s a great tool for real-time analytics and continuous streams.</p>
<h2 id="heading-how-does-flink-work-very-so-much-simplified">How does Flink work? (Very so much simplified)</h2>
<p>Every Flink job has what’s called a <strong>Job manager</strong>, The job manager can have many <strong>task managers</strong>, the task managers have what is called <strong>slots.</strong> Every slot can execute some stage of the pipeline so basically a slot is like a unit of execution. It looks something like this (very much abstracted)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743591790721/535cb92d-9d46-4c4f-b3b9-a793717ef1c9.png" alt class="image--center mx-auto" /></p>
<p><mark>Notice the arrows back and forth from Task managers as they communicate and can send and receive data from each other</mark></p>
<p>When you write some code (usually Java) and submit a Flink job. (Let’s say you want to aggregate counts of videos by their id)</p>
<p>Flink takes your code and creates what is called a Dataflow Graph which basically it then uses to know how to execute the given job.</p>
<p>For example aggregating some data stream and counting can look like this</p>
<pre><code class="lang-java">stream
  .keyBy(video -&gt; video.getId()) # Group by id
  .window(TumblingProcessingTimeWindows.of(Time.minutes(<span class="hljs-number">1</span>)))
  .sum(<span class="hljs-number">1</span>);
</code></pre>
<p>Behind the scenes, Flink translates this into a <strong>directed acyclic graph (DAG)</strong> where each node represents an operation (operator) — like <code>keyBy</code>, <code>window</code>, or <code>sum</code> — and the edges represent the data flowing between them.</p>
<p>This <strong>Dataflow Graph</strong> becomes the blueprint for execution. It helps Flink decide:</p>
<ul>
<li><p>How to <strong>parallelize</strong> the work (e.g., how many tasks should do the aggregation)</p>
</li>
<li><p>How to <strong>shuffle</strong> or <strong>partition</strong> the data (e.g., based on the key) (data flowing between task managers)</p>
</li>
<li><p>Where <strong>state</strong> needs to be managed (e.g., window state, accumulators) (in the example above a tumbling window every minute to the minute)</p>
</li>
<li><p>And how to <strong>recover</strong> from failures by tracking operator state and checkpoints</p>
</li>
</ul>
<p>This is what enables Flink to scale your job from running on your laptop to a 100-node cluster with minimal changes to your code.</p>
<p>The Flink Web UI even lets you inspect this graph visually, showing each operator and its parallelism, helping you understand exactly how your job flows end to end.</p>
<p>Scaling Flink is a whole different story. Depending on your scale, you’ll need to decide on things like the <strong>parallelism</strong> of your job, the <strong>number of TaskManagers</strong>, and the <strong>slots per TaskManager</strong>.</p>
<p>The <strong>parallelism</strong> defines how many parallel instances will be created for <strong>each operator</strong> in your job’s Dataflow Graph. Think of it like this: for every step in your pipeline (e.g., map, keyBy, window, etc.), Flink can spawn multiple subtasks to process data in parallel. The higher the parallelism, the more throughput your job can handle — assuming your hardware can keep up.</p>
<p>You can scale horizontally by:</p>
<ul>
<li><p>Increasing the <strong>number of TaskManagers</strong> (i.e., more JVMs on more machines)</p>
</li>
<li><p>Assigning more <strong>task slots per TaskManager</strong> (i.e., more threads to run subtasks)</p>
</li>
</ul>
<p>Flink’s job manager then maps the parallel subtasks of each operator onto these slots across TaskManagers, distributing the work evenly. If our stream source is Kafka then at least the partition count of the topic would be a good start.</p>
<p>Now that we got a high level on how Flink works, let’s talk about how we’ll design this thing</p>
<h1 id="heading-technical-design">Technical Design</h1>
<p>Let’s say we have 2 Kafka partitions and our parallelism is set to 2. (2 subtasks for every single operator/step), The goal is to somehow aggregate these into counts by video id and find a way to get the top 5 only. How can we do that?</p>
<p>Well a popular data structure for calculating top K from a bunch of elements is a <a target="_blank" href="https://www.youtube.com/watch?v=wptevk0bshY"><mark>PriorityQueue</mark></a> (Min-heap) but how can we leverage it in our design?</p>
<p>We’re actively reading from 2 Kafka partitions data like this <code>{video_id: 12332}</code></p>
<p>We need to do the following:</p>
<ol>
<li><p>Aggregate (Count) all same video ids</p>
</li>
<li><p>Push them into a Priority Queue of size 5 (fixed size/memory)</p>
</li>
<li><p>Return the result</p>
</li>
</ol>
<p>However the steps above are missing something, we have 2 subtasks for aggregating the counts. This means we have one of two approaches here:</p>
<ol>
<li><p>After aggregating push them all to a single node and generate a top k</p>
</li>
<li><p>generate local top k’s for every part and then send them to a node to generate a global top k</p>
</li>
</ol>
<p>The first approach while being simpler but can overload the final destination node with a lot of data which can eat up the memory. Imaging having millions of aggregated records sending them to a single node.</p>
<p>The second approach is more <strong>efficient,</strong> we’ll generate a local top k for every aggregated part, send these over (only fixed size of 5 per node), and generate a top K out of the received local top K’s</p>
<p>The design looks something like this (This design is the technical thinking not how Flink will physically execute this)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743593867081/f410d15d-489b-4874-a30c-d0154392f1d0.png" alt class="image--center mx-auto" /></p>
<p>Now since our requirements are not strict and we don’t need history, then every output of Global top k will be send to <strong>Redis,</strong> so the user can fetch it easily and display it. The key in Redis gets updated every 5 minutes in a Sliding window pattern (The Flink approach we’ll talk about in aggregation), so the design we’ll go with finalizes to this</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743594020608/34830ee4-d0be-49b8-b9bf-da126f300bc4.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-stages-in-flink">Stages in Flink</h1>
<blockquote>
<p>Code here is highly abstracted the repo link will be at the end of the article</p>
</blockquote>
<h2 id="heading-define-source">Define source</h2>
<p>In Flink it goes Source → some operations → sink (destination)</p>
<p>We need to define Kafka as our source of data and this can be done with something like this</p>
<pre><code class="lang-java">        props.setProperty(<span class="hljs-string">"bootstrap.servers"</span>, <span class="hljs-string">"kafka:9092"</span>);
        props.setProperty(<span class="hljs-string">"group.id"</span>, <span class="hljs-string">"video-likes-consumer"</span>);
        FlinkKafkaConsumer&lt;String&gt; consumer = <span class="hljs-keyword">new</span> FlinkKafkaConsumer&lt;&gt;(
                <span class="hljs-string">"video_likes"</span>,
                <span class="hljs-keyword">new</span> SimpleStringSchema(),
                props
        );

        DataStream&lt;String&gt; stream = env.addSource(consumer);
</code></pre>
<p>We’re now consuming from the topic <code>video_likes</code></p>
<h2 id="heading-steps-aggregation-to-top-k">Steps (Aggregation to Top K)</h2>
<p>Now The best thing about Flink is the different API’s it offers, out of the box API’s that do different type of aggregations, Since our requirement is the latest 5 minute most liked and refreshes every minute we can use a sliding window approach (e.g from 0-5, 1-6, 2-7, etc) as the window moves we get the latest 5 minute window only. Before doing that we would need to JSON parse the stream output to extract the video id.(view in repo). The important part here is the aggregation steps.</p>
<pre><code class="lang-java">        likes
        .keyBy(value -&gt; value.f0)
        .window(SlidingProcessingTimeWindows.of(Time.minutes(<span class="hljs-number">5</span>), Time.minutes(<span class="hljs-number">1</span>))) 
        .aggregate(<span class="hljs-keyword">new</span> LocalTopKAggregator(<span class="hljs-number">5</span>))
</code></pre>
<p>It starts off like this, <code>keyBy</code> would group and potentially reshuffle between task managers data so that all ids land and are processed in the same subtask.</p>
<p>Then we specify the window we’ll be working on (in our case sliding window for 5 minutes and moves every minute)</p>
<p>Now <code>aggregate</code> executes continuously as stream data comes in, It’s a custom class that inherits the Default Flink aggregation and adds the local priority queue logic on top of it. The code for this you’ll find in the repository for a more in depth look.</p>
<p><code>LocalTopKAggregator</code> class has a method <code>getResult</code> which executes <strong>per window</strong></p>
<p>This is the method that generates the top K per window.</p>
<pre><code class="lang-java">    <span class="hljs-keyword">public</span> List&lt;Tuple2&lt;String, Integer&gt;&gt; getResult(Map&lt;String, Integer&gt; acc) {
        PriorityQueue&lt;Tuple2&lt;String, Integer&gt;&gt; pq = <span class="hljs-keyword">new</span> PriorityQueue&lt;&gt;(Comparator.comparingInt(t -&gt; t.f1));
        <span class="hljs-keyword">for</span> (Map.Entry&lt;String, Integer&gt; entry : acc.entrySet()) {
            pq.offer(Tuple2.of(entry.getKey(), entry.getValue()));
            <span class="hljs-keyword">if</span> (pq.size() &gt; k) pq.poll();
        }

        List&lt;Tuple2&lt;String, Integer&gt;&gt; result = <span class="hljs-keyword">new</span> ArrayList&lt;&gt;(pq);
        result.sort((a, b) -&gt; Integer.compare(b.f1, a.f1));
        <span class="hljs-keyword">return</span> result;
    }
</code></pre>
<p>Now our output per subtask is the local top K, now we need to group these somewhere to generate the final top K</p>
<pre><code class="lang-java">        likes
        .keyBy(value -&gt; value.f0)
        .window(SlidingProcessingTimeWindows.of(Time.minutes(<span class="hljs-number">5</span>), Time.minutes(<span class="hljs-number">1</span>)))
        .aggregate(<span class="hljs-keyword">new</span> LocalTopKAggregator(<span class="hljs-number">5</span>))

        .map(list -&gt; Tuple2.of(<span class="hljs-string">"global"</span>, list))
         <span class="hljs-comment">// Java erasure ***not business logic***   </span>
        .returns(<span class="hljs-keyword">new</span> TypeHint&lt;Tuple2&lt;String, List&lt;Tuple2&lt;String, Integer&gt;&gt;&gt;&gt;() {})
        .keyBy(t -&gt; t.f0)
        .window(SlidingProcessingTimeWindows.of(Time.minutes(<span class="hljs-number">5</span>), Time.minutes(<span class="hljs-number">1</span>)))
        .process(<span class="hljs-keyword">new</span> GlobalTopKMerge(<span class="hljs-number">5</span>));
</code></pre>
<p>Now we map over our generated local Top K’s and give them all the key name <code>count</code>. Skipping the TypeHint returns which is Java specific to make sure the return type of the above <code>map</code></p>
<p>We group by the key <strong>global</strong></p>
<p>So now We have the key <strong>global</strong> along with an array of local top k’s</p>
<p>We only need to process the latest local top k’s so we add another <code>window</code></p>
<p>If we don’t add this we’ll keep feeding the global top k with all the windows (it’s going to keep history of previous windows not remove them)</p>
<p>Now <code>GlobalTopKMerge</code> merges all the top K’s respectively and then pushes the result to redis.</p>
<pre><code class="lang-java">    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">process</span><span class="hljs-params">(String key,
                        Context context,
                        Iterable&lt;Tuple2&lt;String, List&lt;Tuple2&lt;String, Integer&gt;&gt;&gt;&gt; elements,
                        Collector&lt;Void&gt; out)</span> </span>{

        PriorityQueue&lt;Tuple2&lt;String, Integer&gt;&gt; heap = <span class="hljs-keyword">new</span> PriorityQueue&lt;&gt;(Comparator.comparingInt(t -&gt; t.f1));

        <span class="hljs-keyword">for</span> (Tuple2&lt;String, List&lt;Tuple2&lt;String, Integer&gt;&gt;&gt; localTopK : elements) {
            <span class="hljs-keyword">for</span> (Tuple2&lt;String, Integer&gt; entry : localTopK.f1) {
                heap.offer(entry);
                <span class="hljs-keyword">if</span> (heap.size() &gt; k) {
                    heap.poll();
                }
            }
        }

        List&lt;Tuple2&lt;String, Integer&gt;&gt; finalTopK = <span class="hljs-keyword">new</span> ArrayList&lt;&gt;(heap);
        finalTopK.sort((a, b) -&gt; Integer.compare(b.f1, a.f1));

        pushToRedis(finalTopK);
    }
</code></pre>
<h2 id="heading-sink">Sink</h2>
<p>Finally <code>pushToRedis</code> sets a simple key on redis where the consumer consumes from,</p>
<pre><code class="lang-java">    <span class="hljs-function"><span class="hljs-keyword">private</span> <span class="hljs-keyword">void</span> <span class="hljs-title">pushToRedis</span><span class="hljs-params">(List&lt;Tuple2&lt;String, Integer&gt;&gt; topK)</span> </span>{
      <span class="hljs-keyword">try</span> {
        String redisKey = <span class="hljs-string">"trending:5min"</span>;
        String json = <span class="hljs-keyword">new</span> ObjectMapper().writeValueAsString(topK);
        jedis.set(redisKey, json);
      } <span class="hljs-keyword">catch</span> (Exception e) {
        System.out.println(<span class="hljs-string">"Error pushing to Redis: "</span> + e.getMessage());
      }
    }
</code></pre>
<h1 id="heading-summary">Summary</h1>
<p>Data has become an integral part in today’s modern world, The insights data gives can be a dealbreaker in making millions especially for startups. The best and most fun thing about system design is there is never a one size fits all, different approaches have different trade offs and business is an integral part of knowing which direction to head. Scaling this project can be in a separate article if you like &amp; If you made it to here thank you &amp; hope you learned something valuable today. See you all in the next one!</p>
<h1 id="heading-github">Github</h1>
<ul>
<li><a target="_blank" href="https://github.com/amrelhewy09/topK_kafka_flink.git">https://github.com/amrelhewy09/topK_kafka_flink.git</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Probabilistic Data Structures Part 1]]></title><description><![CDATA[Introduction
Today is all about probabilistic data structures, probabilistic data structures are data structures that use randomization and approximation to achieve efficient storage and processing of large-scale data. These structures typically trad...]]></description><link>https://hewi.blog/probabilistic-data-structures-part-1</link><guid isPermaLink="true">https://hewi.blog/probabilistic-data-structures-part-1</guid><category><![CDATA[data structures]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[scalability]]></category><category><![CDATA[Programming Blogs]]></category><category><![CDATA[Programming Tips]]></category><category><![CDATA[Productivity]]></category><category><![CDATA[software development]]></category><category><![CDATA[AWS]]></category><category><![CDATA[AI]]></category><category><![CDATA[Problem Solving]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Fri, 21 Mar 2025 22:37:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1742596574090/2a8b34a5-6359-4de2-ac1e-14b5628aaf5b.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Today is all about probabilistic data structures, <strong>probabilistic data structures</strong> are data structures that use <strong>randomization</strong> and <strong>approximation</strong> to achieve <strong>efficient storage and processing</strong> of large-scale data. These structures typically trade off <strong>perfect accuracy</strong> for <strong>lower memory usage</strong> and <strong>faster performance</strong>, making them useful for handling big data, streaming data, and distributed systems.</p>
<p>Examples of using something like this would be when you want to get some count of millions of records, when you want to get the cardinality level in a set of data, when you want to check if something exists amongst millions of records. The aim is to sacrifice perfect accuracy for scalability and that is of course if the business is okay with something like this.</p>
<p>I’ll be going through the ones i’ve used and some more too, they are as follows;</p>
<ol>
<li><p>Count-Min Sketch ✅</p>
</li>
<li><p>Bloom Filter ✅</p>
</li>
<li><p>HyperLogLog ✅</p>
</li>
<li><p>Skip Lists ✅</p>
</li>
<li><p>K-Minimum Values</p>
</li>
<li><p>LogLog &amp; SuperLogLog</p>
</li>
<li><p>Top K &amp; Heavy Hitters</p>
</li>
</ol>
<h1 id="heading-count-min-sketch">Count-Min Sketch</h1>
<p>Let’s say we are receiving a continuous stream of data <code>{"video_id": 1}</code> which are view counts, and we’re required to sum them up. The most straightforward way of doing this is by using a HashMap where we’d have the <code>video_id</code> as a key and the value being the total count.</p>
<p>The problem here is that the videos are a lot and the hash map size would increase significantly potentially being memory inefficient. If the business requirement was not strict on showing the exact view counts then count min sketch is the way to go.</p>
<p><strong>Count-Min Sketch</strong> is a probabilistic data structure that provides an <strong>approximate frequency count</strong> of elements in a data stream using <strong>constant space</strong>.</p>
<p>Instead of storing each video’s view count in a <strong>HashMap</strong>, which grows as the number of videos increases, we use a <strong>2D array (matrix)</strong> of counters with multiple hash functions to track approximate counts.</p>
<p><strong>Step by step breakdown:</strong></p>
<ol>
<li><p><strong>Initialize a matrix (w × d)</strong></p>
<ul>
<li><p><code>w</code> columns (width): Represents the number of counters in each row.</p>
</li>
<li><p><code>d</code> rows (depth): Each row corresponds to a different hash function.</p>
</li>
<li><p>All counters start at <code>0</code>.</p>
</li>
</ul>
</li>
<li><p><strong>Updating the Count</strong></p>
<ul>
<li><p>When a new view event <code>{ "video_id": 1 }</code> arrives:</p>
<ul>
<li><p>The <strong>video ID</strong> is hashed using <code>d</code> different hash functions.</p>
</li>
<li><p>Each hash function maps the video ID to a column in its respective row.</p>
</li>
<li><p>The counters at those positions are incremented.</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Querying the Count</strong></p>
<ul>
<li><p>To estimate the view count of a <strong>specific video ID</strong>, hash it with the same <code>d</code> hash functions and mod it with the column length. The more columns you add the more accuracy but also more space.</p>
</li>
<li><p>Look at the corresponding positions in each row and take the <strong>minimum</strong> value across all rows.</p>
</li>
<li><p>This minimizes the effect of hash collisions (hence "count-min").</p>
</li>
</ul>
</li>
</ol>
<p>The reason we pick the minimum value is to minimize the possibility of hash-collisions where different video ids might hit the same row &amp; col. That’s why when always picking the minimum would minimize the collisions. Sometimes also we can have 5 hash functions instead of 3 which would increase the accuracy. But increasing the number of hash functions wouldn’t always be beneficial as there is a point where the more you add won’t offer any significant improvement in accuracy. Below is a gif example of how it works.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742426637625/df800f12-0d98-42bf-8fdd-3f3c72d9b1fa.gif" alt class="image--center mx-auto" /></p>
<h1 id="heading-bloom-filters">Bloom Filters</h1>
<p>Another cool probabilistic data structure now, bloom filters is a space efficient data structure that tells you if an element is 100% not in a set of elements or if <strong>maybe</strong> is.</p>
<p>Meaning it may give false positives but never gives a false negative so it either definitely doesn’t exist or maybe exists.</p>
<p>Similar to the idea of count min sketch a value gets passed in multiple hash functions and the results would then map to different indexes in a <strong>fixed size bit array and the bits at these indexes are set to 1.</strong></p>
<p>So if we’re looking for a value we pass it through multiple hash functions and check if the bits in the respective indices are 1 or not. If there exist any zeroes then its 100% not in the set of elements but if they’re all ones then it maybe in the set or maybe just collided with other members in the set.</p>
<p>Here’s how they work</p>
<p><img src="https://s8.ezgif.com/tmp/ezgif-880eeb815c8b60.gif" alt="[animate output image]" /></p>
<h1 id="heading-skip-lists">Skip Lists</h1>
<p>Skip lists are really efficient data structures that help optimize searching, insertion and deletion in linked lists. In sorted linked lists finding elements will always take O(n) time because you have to traverse to reach the required node. However a way to optimize that is by using skip lists.</p>
<p>They are simply levels above each other (the levels are called express lanes (faster routes)) on top of a basic linked list, something like this</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742595851014/2120f65e-4b9a-4172-afec-a586c66f9485.png" alt class="image--center mx-auto" /></p>
<p>And when traversing we start from the top left 1 and check the next station in the express lane. So for example if we’re looking for the value 5</p>
<ol>
<li><p>Check the first top left value and its next station in express lane 4</p>
</li>
<li><p>since 5 &gt; 4 we go through the express lane</p>
</li>
<li><p>we check the next express lane station 6</p>
</li>
<li><p>since 6 &gt; 5 we traverse through the basic linked list till we find 5</p>
</li>
</ol>
<p>This can go for multiple levels having a 1 → 5 layer above the one in the image above. This helps to skip a lot of nodes which would optimize search, insertion and deletion time effectively.</p>
<p>Here’s an example if we’re looking for the value 5 in the skip list.</p>
<p><img src="https://s8.ezgif.com/tmp/ezgif-86c31f6f190f4d.gif" alt="[animate output image]" /></p>
<h1 id="heading-hyperloglog">HyperLogLog</h1>
<p>This one is all about cardinality, it would give you an estimation on how many unique items exist in a set of elements and is efficient memory wise.</p>
<p>Think of a <strong>lottery</strong> where you randomly pick a number between 1 and 100.</p>
<ul>
<li><p>If you only draw <strong>5 numbers</strong>, you probably won’t get anything close to 100.</p>
</li>
<li><p>If you draw <strong>a million numbers</strong>, chances are, you'll eventually get 100.</p>
</li>
</ul>
<p>🔸 The more numbers you pick, the higher the chance of getting a <strong>big</strong> number.</p>
<p>HyperLogLog works the same way, but instead of picking numbers, it’s looking at <strong>how many leading zeroes</strong> appear in a hash.</p>
<ol>
<li><p>Each an every element gets hashed into a random <strong>binary hash (a binary representation of a hashed value)</strong></p>
</li>
<li><p>Once we start hashing elements, we look at <strong>where the first</strong> <code>1</code> appears in the binary hash and remember the <strong>biggest number we've seen so far</strong>.</p>
</li>
</ol>
<p><strong>The more unique values we hash, the higher the chance that one of them has many leading zeroes</strong>.</p>
<p><strong>If you’ve seen a hash with a</strong> <code>1</code> at position 7, that means you must have seen a lot of unique elements to get such a rare case.</p>
<p>HyperLogLog <strong>tracks the largest position seen</strong> and uses math to estimate how many unique elements must exist for that to happen.</p>
<p>Now that items are hashed, they are split into different buckets, buckets help smooth out extreme cases where for example we have 3 items but luckily a leftmost 1 was found at the 7th position in one of the items. let me explain</p>
<p>Buckets are <strong>small memory slots</strong> where HyperLogLog stores information <strong>separately for different groups of elements</strong>.</p>
<p>Imagine you're counting <strong>unique people</strong> entering a stadium.</p>
<ul>
<li><p>If you only look at <strong>one entrance</strong>, you might get a <strong>bad estimate</strong> because not all people use that entrance.</p>
</li>
<li><p>Instead, you <strong>track multiple entrances separately</strong> and combine the results to get a <strong>better</strong> estimate.</p>
</li>
</ul>
<p>Each bucket <strong>only remembers the biggest value</strong> it has seen.</p>
<p>Now, instead of estimating based on <strong>one extreme value</strong>, we <strong>take an average of all the buckets</strong>.</p>
<p>For example:</p>
<ul>
<li><p>If all buckets saw <strong>only small values (1, 1, 2)</strong>, we probably have <strong>few unique elements</strong>.</p>
</li>
<li><p>If some buckets saw <strong>high values (like 5, 6, 7)</strong>, that suggests we have <strong>a lot of unique elements</strong>.</p>
</li>
</ul>
<p>It’s kind of confusing but the more buckets the better accuracy you might have</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1742594966180/a9e803c2-5fd9-43ff-9eb1-365003b5c386.png" alt class="image--center mx-auto" /></p>
<p>Then we proceed to take the average of all the buckets to estimate the number of unique elements using a mathematical formula. (something called the harmonic mean)</p>
]]></content:encoded></item><item><title><![CDATA[Scaling Rails: Understanding Puma Workers, Threads, and Database Connection Pooling]]></title><description><![CDATA[Introduction
Hello folks! in this article I’m going to be going through all the needed calculations to properly tune your rails app in production. This article aims to provide a schematic for calculating the threads needed both application wise and d...]]></description><link>https://hewi.blog/scaling-rails-understanding-puma-workers-threads-and-database-connection-pooling</link><guid isPermaLink="true">https://hewi.blog/scaling-rails-understanding-puma-workers-threads-and-database-connection-pooling</guid><category><![CDATA[Rails]]></category><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[backend]]></category><category><![CDATA[scalability]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Mon, 20 Jan 2025 12:07:14 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1737374722551/ae6d267f-f6e7-4f9b-9655-f50bbc82f4a4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Hello folks! in this article I’m going to be going through all the needed calculations to properly tune your rails app in production. This article aims to provide a schematic for calculating the threads needed both application wise and database wise so no request or background job fails to acquire a database connection when things get heated. Let’s start.</p>
<h1 id="heading-puma-and-its-workers">Puma and its workers</h1>
<p>First of all let’s start off by talking about Puma which is the default rails web server and how it operates generally.</p>
<p>When booting up puma. It has configurations related to the number or <code>workers</code> and the <code>number of threads per worker</code> but what are both?</p>
<p>When setting the <code>worker</code> variable to 2 for example. Puma will then fork its operating system process however many times you set <code>worker</code> (in our case 2). This means you will have <code>workers</code> count of your rails code as instances ready to serve http requests.</p>
<p>In each puma worker there will be multiple threads based on the <code>threads</code> configuration. However due to the GIL lock in ruby, only one thread can be executed at a moment of time <strong>unless this thread is doing some blocking operation (I/O) then the GIL lock is released and other threads can run safely.</strong></p>
<p>So far that means if we have an instance with 2 workers and 2 threads per worker we will have the following number of threads:</p>
<blockquote>
<p>Number of DB threads = Worker count * thread per worker = 2 × 2 = 4</p>
</blockquote>
<p>That means for each thread we need to reserve a potential <strong>thread from the database connection pool because potentially every application thread might connect to the database at one point depending on the load.</strong></p>
<p>Rails maintains its own <strong>database connection pool</strong>, with a new pool created for each worker process. Threads within a worker will operate on the same pool.. If a Puma Worker utilizes 5 threads per worker, then the database.yml must be configured to a connection pool of 5, since each thread could possibly establish a database connection.</p>
<p>Since each Worker is spawned by a system fork(), the new worker will have its own set of 5 threads to work with, and thus for the new Rails instance created, the database.yml will still be set to a connection pool of 5.</p>
<h1 id="heading-rails-database-connection-pool-vs-actual-database-pool">Rails Database connection pool VS Actual Database Pool</h1>
<p>It’s important before moving forward to be able to differentiate between these two as they might confuse a lot of people. In each rails app it maintains its own database connection pool. This is nothing but a pool of threads reserved for the database connection when needed a thread gets picked from the pool, does its job and goes back to be reused again later on. It’s usually by default set by the <code>&lt;%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %&gt;</code> This resembles the max threads in a single puma worker. Meaning for every instance of the ruby code invoked we will have a db pool of the thread count specified in this env.</p>
<p>Actual database pool for example the PostgreSQL connection pool (or the connections PostgreSQL accepts) defines how many connections the PostgreSQL server can manage simultaneously across all clients where it limits the total number of concurrent database connections the PostgreSQL server can handle. Controlled in <code>postgresql.conf</code> by the <code>max_connections</code> parameter we can alter it accordingly.</p>
<p>This will come in handy later on so make sure you understand the difference before proceeding.</p>
<h1 id="heading-sidekiq">Sidekiq</h1>
<p>Now let’s imagine this scenario.</p>
<ol>
<li><p>Postgres <code>max_connections</code> is set to 100</p>
</li>
<li><p>We have a rails app operated by Puma web server and the config is as follows:</p>
<ol>
<li><p><code>workers</code> is set to 5</p>
</li>
<li><p><code>max_threads</code> is set to 20</p>
</li>
</ol>
</li>
</ol>
<p>Knowing this information We have 100 <code>max_connections</code> from Postgres side and <code>5×20=100</code> threads potentially connecting to Postgres from the Application side. <strong>They are the same count</strong></p>
<p>Now comes in a background job process such as sidekiq. And sidekiq has it’s completely separate configuration regarding concurrency where <code>concurrency</code> is the number of threads for sidekiq to operate on.</p>
<p>Threads in Ruby operate under fundamentally different paradigms, largely due to the <strong>Global Interpreter Lock (GIL)</strong> in MRI Ruby. In Rails (running under a server like Puma), each thread is responsible for handling <strong>one request at a time</strong>. There is no true concurrency for <strong>CPU-bound</strong> tasks because of the <strong>GIL</strong> in MRI Ruby. Threads in Rails can perform concurrent operations when waiting for <strong>I/O-bound tasks</strong> (e.g., database queries, external API calls). During such waiting periods, other threads can process requests. A thread finishes one request entirely before moving on to the next.</p>
<p>In Sidekiq the GIL lock is still there but the nature of <code>job processing</code> makes it feel more concurrent. Because job processing might have a lot more I/O (db access, external API calls) it has a lot more context switching between threads than the web server (GIL gets released).</p>
<p>When tuning the postgres <code>max_connections</code> we need to make sure that we take sidekiq into consideration.</p>
<p>The equation becomes as follows:</p>
<blockquote>
<p>pg_max_connections = (Puma worker count * RAILS_MAX THREADS) + (Sidekiq concurrency * Sidekiq process count)</p>
</blockquote>
<p>Meaning if we have 1 sidekiq process with a concurrency of <code>5</code> We need to increase the <code>max_connections</code> from 100 to 105 so each thread can potentially grab a connection to the database.</p>
<h1 id="heading-summary">Summary</h1>
<p>When stress testing and scaling a Rails application, it’s crucial to understand how all components work together seamlessly. A lack of database connections leading to 500 errors is one of the worst experiences a user can face, and addressing this should be a top priority. By gaining deeper insights into these mechanisms, you can ensure your application is resilient, scalable, and user-friendly. I hope this article has helped clarify these concepts, and I look forward to sharing more in the next one!</p>
<p>Also subscribe to the YouTube i’ll be doing more than Leetcode over there soon 😄<a target="_blank" href="https://www.youtube.com/@techstuffrandom">https://www.youtube.com/@techstuffrandom</a></p>
]]></content:encoded></item><item><title><![CDATA[How does Postgres persist to disk? What is WAL all about?]]></title><description><![CDATA[Introduction
Hello folks! in this quick article i’m going to be talking about how a database like Postgres actually persists to disk and what happens behind the scenes. What is WAL all about and what does it even stand for? Let’s dive in.
When it com...]]></description><link>https://hewi.blog/how-does-postgres-persist-to-disk-what-is-wal-all-about</link><guid isPermaLink="true">https://hewi.blog/how-does-postgres-persist-to-disk-what-is-wal-all-about</guid><category><![CDATA[PostgreSQL]]></category><category><![CDATA[software development]]></category><category><![CDATA[backend]]></category><category><![CDATA[engineering]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sat, 11 Jan 2025 10:53:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1736592806598/88779235-b130-44e7-8f88-d4e0605a6c91.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Hello folks! in this quick article i’m going to be talking about how a database like Postgres actually persists to disk and what happens behind the scenes. What is WAL all about and what does it even stand for? Let’s dive in.</p>
<p>When it comes to I/O operations (in our case writing to disk). Optimizations become a must because I/O operations are very expensive especially writing to disk due to a lot of factors. Postgres tackles this problem in a clever manner.</p>
<p>When a write request is made to the database, You’d think that it would persist it to disk right away, but that’s not what actually happens behind the scenes. Introducing <strong>Postgres Shared Buffers</strong></p>
<h1 id="heading-shared-buffers">Shared Buffers</h1>
<p>When a <strong>write</strong> is made it isn’t instantly flushed to disk, Postgres actually loads the page getting impacted by the write to memory and adjusts it in memory.</p>
<blockquote>
<p>Postgres saves rows in pages and whenever we update a certain row we find the page it’s in, load it to memory and update the page with the new record respectively. Same for writes whether its a new page or an existing one that’s not yet full.</p>
</blockquote>
<p>This is called a <code>dirty page</code> where it needs to be written back to disk later. Later on Postgres relies on <strong>background processes</strong> like the <strong>background writer</strong> or <strong>checkpoints</strong> to write dirty pages from shared buffers to disk asynchronously. (We’ll get into that)</p>
<p>But the thing is this approach hugely minimizes I/O by batching the flush operation to disk and not have it going on for each and every write.</p>
<p>On the other hand, when a read request is made Postgres first checks the shared buffers for the pages being requested (whether dirty or not). If found it then proceeds to serve it from there, if not it loads the page from disk and adds it to the shared buffers cache. If the shared buffer is full a <strong>victim dirty page</strong> will be written to disk and the newly read page will be replaced by it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736590638514/5ecc5857-5281-41eb-b582-9969f30b1188.png" alt class="image--center mx-auto" /></p>
<p>Now the question that arises is what happens if you lose power half-way through writing to the data files? Let’s say some write was made to memory and during flushing it to disk a power cut happened causing that data to be inconsistent and weird. The client still thinks he made that write when In reality it never persisted to disk completely. This can cause a lot of data corruption and integrity loss. Hence introducing WAL</p>
<h1 id="heading-wal-write-ahead-log">WAL (Write Ahead Log)</h1>
<p>WAL or Write Ahead Log is a mechanism to ensure the consistency and safety of data. It is a technique where every change to the database is logged <strong>before</strong> it is applied to the actual data files. This log acts as a journal that records all modifications. It is an append only log that logs everything a user writes to the database while simultaneously updating the data pages in shared buffer as mentioned before. Any newly written data can remain in the shared buffer as long as we have a log that tracked the change. In case anything goes wrong we can reconstruct the state from the log.</p>
<p>The WAL is much smaller than the actual data files, and so we can flush them relatively faster. They are also sequential unlike data pages which are random. Disks love sequential writes, which is another benefit of WAL.</p>
<p>Every WAL entry is first written to a <code>WAL buffer</code> in memory. Then when a certain trigger comes this buffer gets flushed to disk. The trigger can either be when the buffer reaches a certain size or when a certain period passes. These are all configurable from Postgres’s side. Once flushed to disk these entries can be announced <strong>committed.</strong></p>
<p>The database can crash after writing WAL entries (before flushing shared buffer to disk), that is fine, as long we know the transaction state belonging to each WAL entry we can discard or omit uncommitted WAL entries upon recovery (to ensure data consistency).</p>
<p>For example if you are in the middle of a transaction and the database crashed, we consider the transaction rolled-back by default. I will do another article explaining how WAL actually writes transactions</p>
<p>When all the data files have been flushed and updated to reflect the information on the WAL This is something called <strong>checkpointing.</strong> Once this happens <strong>a</strong> <strong>checkpoint record</strong> in the Write-Ahead Log (WAL) is recorded, marking the point up to which all changes have been flushed to disk.</p>
<p>In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the WAL (known as the redo record) from which it should start the REDO operation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1736592687063/3c59cbe4-346b-42f4-b75b-b9dfd5036037.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-references">References</h1>
<ol>
<li><p><a target="_blank" href="https://x.com/hnasr/status/1867253354662920379">https://x.com/hnasr/status/1867253354662920379</a></p>
</li>
<li><p><a target="_blank" href="https://www.postgresql.org/docs/current/wal-configuration.html">https://www.postgresql.org/docs</a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[UUID, ULID, NanoIDs and Snowflake IDs , What's the difference?]]></title><description><![CDATA[Hello everyone and welcome to the first article of 2025! In this article we’re going to be talking all about unique id generators. Different schemes that generate unique ids for a specific scope whether global or custom. Let’s start by discussing eac...]]></description><link>https://hewi.blog/uuid-ulid-nanoids-and-snowflake-ids-whats-the-difference</link><guid isPermaLink="true">https://hewi.blog/uuid-ulid-nanoids-and-snowflake-ids-whats-the-difference</guid><category><![CDATA[software development]]></category><category><![CDATA[backend]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Wed, 01 Jan 2025 15:14:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1735744418527/11bbf511-9dc4-4748-a38e-046a3205cbba.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hello everyone and welcome to the first article of 2025! In this article we’re going to be talking all about unique id generators. Different schemes that generate unique ids for a specific scope whether global or custom. Let’s start by discussing each one separately and having a pro con list at the end that might guide you to the answer of which one do I use? Let’s start</p>
<h1 id="heading-uuids">UUIDs</h1>
<p>UUID stands for universal unique identifiers. They allow generating ids in a way that guarantees uniqueness without knowledge of other systems. They have many versions some are not used as much anymore let’s discuss each version</p>
<h2 id="heading-uuidv1">UUIDv1</h2>
<p>A UUID version 1 is known as a time-based UUID and can be broken down as follows:</p>
<p><code>UUIDv1: [time_low]-[time_mid]-[time_high_and_version]-[clock_seq_and_reserved]-[node]</code></p>
<p>UUIDv1 uses a <strong>60-bit timestamp</strong> that represents the number of <strong>100-nanosecond intervals</strong> elapsed since the Gregorian epoch: <strong>15 October 1582 00:00:00 UTC</strong>. This timestamp is used to ensure that UUIDs generated at different times are unique. e.g <strong>13743895347200 represents the number of 100 nanosecond intervals passed since epoch.</strong> Using the nanosecond representation allows for more uniqueness and finer granularity. Using <strong>100-nanosecond intervals</strong> allows for up to <strong>10 million unique time intervals per second</strong>.</p>
<p>The total timestamp is 60 bits:</p>
<ul>
<li><p>Most significant bits represent older times.</p>
</li>
<li><p>Least significant bits represent the finer granularity within the most recent time interval.</p>
</li>
</ul>
<p><strong><em>Timestamp and its 60-Bit Division in UUIDv1:</em></strong></p>
<p>The 60-bit timestamp is divided across three fields in UUIDv1:</p>
<ol>
<li><p><strong>Time Low</strong>: Stores the <strong>least significant 32 bits</strong> of the timestamp.</p>
</li>
<li><p><strong>Time Mid</strong>: Stores the <strong>next 16 bits</strong>.</p>
</li>
<li><p><strong>Time High</strong>: Stores the <strong>most significant 12 bits</strong>.</p>
</li>
</ol>
<p>These bits are arranged in a specific format to construct the UUID.</p>
<p><strong>Node</strong> is a 48-bit value, often derived from the MAC address of the machine generating the UUID. If the MAC address is unavailable, a random value is used instead.</p>
<p>The <strong>Clock sequence</strong> in <strong>UUID Version 1 (UUIDv1)</strong> is a 14-bit value that ensures uniqueness when generating UUIDs in situations where the system clock may not be reliable.</p>
<p>It helps maintain uniqueness when:</p>
<ul>
<li><p>The system clock is adjusted backward.</p>
</li>
<li><p>The system clock cannot guarantee monotonicity (e.g., due to manual adjustments or hardware issues).</p>
</li>
</ul>
<h2 id="heading-uuidv2">UUIDv2</h2>
<p>In V2 the <code>low_time</code> segment of the structure was replaced with a POSIX user ID. The theory was that these UUIDs could be traced back to the user account that generated them. Since the <code>low_time</code> segment is where much of the variability of UUIDs reside, replacing this segment increases the chance of collision. As a result, this version of the UUID is rarely used.</p>
<h2 id="heading-uuidv3-amp-uuid-v5">UUIDv3 &amp; UUID v5</h2>
<p>Versions 3 and 5 of UUIDs are very similar. The goal of these versions is to allow UUIDs to be generated in a deterministic way so that, given the same information, the same UUID can be generated. These implementations use two pieces of information: a namespace (which itself is a UUID) and a name. These values are run through a hashing algorithm to generate a 128-bit value that can be represented as a UUID.</p>
<p>The key difference between these versions is that version 3 uses an MD5 hashing algorithm, and version 5 uses SHA1.</p>
<h2 id="heading-uuidv4">UUIDv4</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735738242100/0077c61a-75d8-481c-9b4a-7d1054ed1007.png" alt class="image--center mx-auto" /></p>
<p>Version 4 is known as the random variant because, as the name implies, the value of the UUID is almost entirely random. The exception to this is the first position in the third segment of the UUID, which will always be <code>4</code> to signify the <strong>version</strong> used.</p>
<h2 id="heading-uuidv6">UUIDv6</h2>
<p>Version 6 is nearly identical to Version 1. The only difference is that the bits used to capture the timestamp are <strong>flipped</strong>, meaning the most significant portions of the timestamp are stored first.</p>
<p><code>[time_high_and_version]-[time_mid]-[time_low]-[clock_seq_and_reserved]-[node]</code></p>
<p>The main reason for this is because UUIDV1 had problems when it came to two types or sorting,</p>
<p><strong>Lexicographical order</strong> is similar to how words are arranged in a <strong>dictionary</strong> or <strong>alphabetical</strong> order. It's based on the <strong>lexicographic (dictionary) comparison</strong> of characters or symbols in a string, from left to right.</p>
<p><strong>Chronological order</strong> refers to sorting based on <strong>time</strong> — from the earliest to the latest (or vice versa). This is the kind of order you'd expect when sorting timestamps, dates, or events. In chronological order, items are compared based on their <strong>relative timing</strong>.</p>
<p>UUIDV1 is designed to be unique identifiers that can include a <strong>timestamp</strong>. However, its structure can cause confusion because <strong>lexicographical sorting</strong> does not always align with <strong>chronological sorting</strong>, due to how fields are ordered. e.g</p>
<pre><code class="lang-go">UUID1: a1b2c3d4-e5f6<span class="hljs-number">-11</span>ec<span class="hljs-number">-9</span>abc<span class="hljs-number">-123456789</span>abc
UUID2: a1b2c3d5-e5f6<span class="hljs-number">-11</span>ec<span class="hljs-number">-9</span>abc<span class="hljs-number">-123456789</span>abc
</code></pre>
<ul>
<li><p><strong>Chronological order</strong> would expect <code>UUID1</code> to come before <code>UUID2</code> because <code>UUID1</code> represents an earlier time.</p>
</li>
<li><p><strong>Lexicographical order</strong>, however, would first compare <code>time_low</code> (the first field <code>a1b2c3d4</code>), and since <code>a1b2c3d4</code> is less than <code>a1b2c3d5</code>, it might place <code>UUID1</code> before <code>UUID2</code> — <strong>this happens correctly</strong>, but in some cases, this alignment doesn’t hold true.</p>
</li>
<li><p>If the <strong>time_high_and_version</strong> field were placed at the start, then chronological sorting would naturally match lexicographical sorting, as in <strong>UUIDv6</strong>.</p>
</li>
</ul>
<h2 id="heading-uuidv7">UUIDv7</h2>
<p>Version 7 is also a time-based UUID variant, but it integrates the more commonly used Unix Epoch timestamp instead of the Gregorian calendar date used by Version 1. The other key difference is that the node (the value based on the system generating the UUID) is replaced with randomness, making these UUIDs less trackable back to their source.</p>
<h2 id="heading-use-cases-for-uuids"><strong>Use Cases for UUIDs</strong></h2>
<ul>
<li><p>Session Identifiers</p>
</li>
<li><p>File Storage and Versioning</p>
</li>
<li><p>API Tokens and Authentication</p>
</li>
<li><p>E-commerce and Order Tracking</p>
</li>
<li><p>Event Tracking and Logs</p>
</li>
</ul>
<h1 id="heading-ulids">ULIDs</h1>
<p>A ULID is a <strong>128-bit</strong> identifier, represented as a <strong>26-character string</strong> encoded in <strong>Base32</strong> (with a specific alphabet). It has two main components:</p>
<ul>
<li><p><strong>Timestamp</strong> (48 bits or 6 bytes) it is the first component of a ULID and is stored in <strong>milliseconds</strong> since the <strong>UNIX epoch</strong> (1970-01-01 00:00:00 UTC). The timestamp is packed into the first 48 bits of the ULID, which allows it to be <strong>lexicographically sortable</strong>. This means ULIDs generated in <strong>chronological order</strong> will <strong>sort correctly</strong> without needing special sorting logic.</p>
</li>
<li><p><strong>Randomness</strong> (80 bits or 10 bytes) to ensure that ULIDs are <strong>globally unique</strong>, the second part (after the timestamp) is made up of <strong>80 random bits</strong>. This random component ensures that even if two ULIDs are generated at the exact same millisecond, they will still be distinct.</p>
</li>
</ul>
<h4 id="heading-example">Example:</h4>
<ul>
<li><p>Base32 alphabet: <strong>A-Z, 2-7</strong></p>
</li>
<li><p>ULID: <code>01FZQZ4E0AMK4NK9F7J8N9DAX8</code></p>
</li>
</ul>
<h3 id="heading-use-cases-for-ulids"><strong>Use Cases for ULIDs</strong></h3>
<ul>
<li><p><strong>Distributed systems</strong>: When you need globally unique IDs that are generated independently and can be sorted chronologically.</p>
</li>
<li><p><strong>Database indexing</strong>: ULIDs can be used as primary keys because they are lexicographically sortable, reducing fragmentation and improving performance.</p>
</li>
<li><p><strong>Caching</strong>: When creating time-sensitive keys or identifiers that need to be unique and sortable.</p>
</li>
<li><p>Also all UUID use cases can work here too.</p>
</li>
</ul>
<h3 id="heading-uuid-vs-ulid"><strong>UUID vs ULID</strong></h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td>ULID</td><td>UUID</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Length</strong></td><td>26 characters</td><td>36 characters (Hexadecimal)</td></tr>
<tr>
<td><strong>Encoding</strong></td><td>Base32</td><td>Hexadecimal (Base 16)</td></tr>
<tr>
<td><strong>Timestamp</strong></td><td>48 bits (millisecond precision)</td><td>60 bits (100-nanosecond precision)</td></tr>
<tr>
<td><strong>Sortable</strong></td><td>Yes, lexicographically sortable by time</td><td>No (UUIDv6 s partially sortable)</td></tr>
<tr>
<td><strong>Randomness</strong></td><td>80 bits</td><td>80 bits (in UUIDv4)</td></tr>
<tr>
<td><strong>Use Case</strong></td><td>Ideal for distributed systems and databases</td><td>General purpose (e.g., unique identifiers)</td></tr>
</tbody>
</table>
</div><h1 id="heading-nanoids">NanoIDs</h1>
<p>Nanoid is a small, <strong>secure</strong>, and <strong>URL-friendly</strong> <strong>unique identifier</strong> generator, typically consisting of a <strong>fixed-length</strong> string of random characters. It is much smaller in size compared to UUIDs and ULIDs, making it more efficient for use in contexts where shorter identifiers are needed. They do not contain timestamps and is completely random.</p>
<h2 id="heading-key-characteristics-of-nano-ids"><strong>Key Characteristics of Nano IDs</strong>:</h2>
<ul>
<li><p><strong>Length</strong>: A NanoID is around <strong>21 characters</strong> long, but the length can be customized.</p>
</li>
<li><p><strong>Alphabet</strong>: It uses a custom alphabet that avoids characters that might be confusing or problematic in URLs (like <code>/</code>, <code>+</code>, and <code>=</code>).</p>
</li>
</ul>
<p>Nanoids are typically generated from a <strong>cryptographically secure random source</strong>, and the characters in the resulting identifier are drawn from a custom alphabet.</p>
<p><strong>cryptographically secure random numbers</strong> (CSPRNGs) alone do not guarantee <strong>uniqueness</strong>. They ensure that the values generated are <strong>unpredictable</strong> and <strong>hard to reproduce</strong>, but they do not inherently ensure that each generated value is <strong>unique</strong> across all possible values.</p>
<p>The default alphabet used by Nanoid consists of the following <strong>64 characters</strong>:</p>
<p><code>abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789</code></p>
<p>e.g <code>V1StGXR8ZtM08l5c5yLyzLq</code></p>
<h1 id="heading-snowflake-ids">Snowflake IDs</h1>
<p><strong>Snowflake IDs</strong> are another form of unique identifier, popularized by <strong>Twitter</strong> and used in systems like <strong>Twitter’s distributed ID generator</strong>. They are designed to be <strong>unique</strong>, <strong>sortable</strong>, and <strong>compact</strong>, and are particularly well-suited for <strong>high-performance</strong>, <strong>distributed systems</strong>.</p>
<p>A typical Snowflake ID is a <strong>64-bit</strong> integer, and the structure is usually broken down as follows:</p>
<ul>
<li><p>Timestamp (41 Bit) Millisecond timestamp since epoch</p>
</li>
<li><p>Datacenter ID (5 Bit)</p>
</li>
<li><p>Worker/Node ID (5 bit)</p>
</li>
<li><p>Sequence (12 bits) Sequence number for handling multiple IDs generated in the same millisecond</p>
</li>
<li><p>total bits (64) which is the total size of the snowflake id.</p>
</li>
</ul>
<h3 id="heading-key-characteristics-of-snowflake-ids"><strong>Key Characteristics of Snowflake IDs</strong>:</h3>
<ol>
<li><p><strong>Time-based</strong>: The first 41 bits represent the timestamp, which makes Snowflake IDs <strong>chronologically sortable</strong>.</p>
</li>
<li><p><strong>Machine/Node-aware</strong>: Snowflake IDs include the <strong>datacenter</strong> and <strong>worker (node)</strong> IDs to avoid collisions in distributed environments.</p>
</li>
<li><p><strong>High throughput</strong>: With a <strong>12-bit sequence number</strong>, Snowflake IDs can generate up to <strong>4096 IDs per millisecond</strong> per machine, ensuring <strong>high throughput</strong> in distributed systems.</p>
</li>
<li><p><strong>Compact</strong>: The 64-bit integer format keeps the ID size compact, reducing storage and indexing overhead.</p>
</li>
<li><p><strong>Unique</strong>: The combination of timestamp, machine ID, and sequence number guarantees <strong>global uniqueness</strong>.</p>
</li>
</ol>
<h3 id="heading-best-use-cases-for-snowflake-ids"><strong>Best Use Cases for Snowflake IDs</strong></h3>
<ul>
<li><p>High-Performance Distributed Systems</p>
</li>
<li><p>Event Sourcing</p>
</li>
<li><p>Scalable Web Applications</p>
</li>
<li><p>Logging and Monitoring Systems</p>
</li>
</ul>
<p>A comparison between it and UUIDs</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td><strong>Snowflake ID</strong></td><td><strong>UUID</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Length</strong></td><td>64-bit integer (compact)</td><td>128-bit (longer and bulkier)</td></tr>
<tr>
<td><strong>Structure</strong></td><td>Time-based with machine/worker ID &amp; sequence</td><td>Completely random (UUIDv4), or timestamp-based (UUIDv1/6)</td></tr>
<tr>
<td><strong>Time-sortable</strong></td><td>Yes, naturally sortable by timestamp</td><td>Not time-sortable (except UUIDv1)</td></tr>
<tr>
<td><strong>Uniqueness</strong></td><td>Globally unique (based on machine &amp; sequence)</td><td>Globally unique (UUIDv4 random or UUIDv1 timestamp)</td></tr>
<tr>
<td><strong>Performance</strong></td><td>High throughput with up to 4096 IDs per ms</td><td>Lower throughput (especially with UUIDv1)</td></tr>
<tr>
<td><strong>Collisions</strong></td><td>Extremely low, even in distributed systems</td><td>Low (but possible with UUIDv4 in high generation rate)</td></tr>
<tr>
<td><strong>Use Case</strong></td><td>High-throughput, distributed systems</td><td>General use cases,, API tokens</td></tr>
</tbody>
</table>
</div><h1 id="heading-summary">Summary</h1>
<p>In this article we went through the most popular unique identifier generation schemes listing the use cases for each and every one. If you’re planning on adding one I recommend understanding the main differences between them before making a decision because choosing a wrong unique identifier scheme in a big scale could affect performance in a negative way. That’s been it for this one see you in the next!</p>
<h1 id="heading-references">References</h1>
<ol>
<li><p><a target="_blank" href="https://planetscale.com/blog/the-problem-with-using-a-uuid-primary-key-in-mysql#uuidv4">https://planetscale.com/blog/the-problem-with-using-a-uuid-primary-key-in-mysql#uuidv4</a></p>
</li>
<li><p><a target="_blank" href="https://adileo.github.io/awesome-identifiers/">https://adileo.github.io/awesome-identifiers/</a></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Memory-Efficient Byte Processing: Streaming for Large Blobs]]></title><description><![CDATA[Hello folks! In this article i’m going to be talking about some tips on how to minimize memory usage (RAM) while dealing with large blobs of data. Whether it be downloading files, reading data from source and writing to destination, etc. I’ll be doin...]]></description><link>https://hewi.blog/memory-efficient-byte-processing-streaming-for-large-blobs</link><guid isPermaLink="true">https://hewi.blog/memory-efficient-byte-processing-streaming-for-large-blobs</guid><category><![CDATA[Go Language]]></category><category><![CDATA[backend]]></category><category><![CDATA[streaming]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Fri, 27 Dec 2024 15:00:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1735311622213/d690554b-2ab2-406f-8e71-4b35cfd3e951.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hello folks! In this article i’m going to be talking about some tips on how to minimize memory usage (RAM) while dealing with large blobs of data. Whether it be downloading files, reading data from source and writing to destination, etc. I’ll be doing a demo in Go monitoring the memory usage and talking about how streaming the data from source to destination is a better approach. Let’s get started.</p>
<h1 id="heading-naive-approach">Naive Approach</h1>
<p>Let’s say we need to download a large file in our application code and save it somewhere on disk.</p>
<p>The naive approach someone would do is the following:</p>
<pre><code class="lang-go">
<span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">downloadFile</span><span class="hljs-params">(filepath <span class="hljs-keyword">string</span>, url <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">error</span></span> {
    out, err := os.Create(filepath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> out.Close()

    resp, err := http.Get(url)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> resp.Body.Close()

    data, _ := io.ReadAll(resp.Body) <span class="hljs-comment">// Whole file in memory btw</span>
    printMemStats()

    _, err = out.Write(data)
    <span class="hljs-keyword">return</span> err
}
</code></pre>
<p>This snippet does the following:</p>
<ol>
<li><p>Creates a new file for the &lt;to be downloaded file&gt;</p>
</li>
<li><p>Downloads the file</p>
</li>
<li><p>Copies the bytes from the downloaded blob to the created file</p>
</li>
</ol>
<p>I have the function <code>printMemStats</code> that’ll give us info on memory usage</p>
<p>Running this on a 100MB file and monitoring memory we can deduct the following:</p>
<pre><code class="lang-go">Alloc: <span class="hljs-number">117.94</span> MB (currently in use)
TotalAlloc: <span class="hljs-number">587.91</span> MB (total allocated memory) (This is the total amount of memory that has been allocated by the program since it started. It includes both the memory that is still in use and the memory that has been released by the garbage collector.)
Sys: <span class="hljs-number">253.92</span> MB (total memory reserved from the OS)
HeapAlloc: <span class="hljs-number">117.94</span> MB (heap memory in use)
HeapSys: <span class="hljs-number">247.11</span> MB (total heap memory reserved from OS to the app)
NumGC: <span class="hljs-number">21</span> (number of garbage collections)
</code></pre>
<p>Looking at the Allocated memory 118MBs were in use and it makes sense because the downloaded file was 100MB alone + some other memory required by Go runtime.</p>
<p>Imagine this file instead being of size 1 GB. Having a single process in the go app honking 1 GB of memory is very bad practice. We can do better let’s now discover the art of <strong>streaming data</strong></p>
<h1 id="heading-streaming-data">Streaming Data</h1>
<p>So the idea simplified is instead of having the flow look like this</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735309609120/0d367a5c-7355-463d-a55a-c10f3de17919.png" alt class="image--center mx-auto" /></p>
<p>How about we get rid of the red part and go straight to <strong>writing to the file!</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735309723714/6f8c52ef-4582-4f19-95f6-11d27a8fbc6c.png" alt class="image--center mx-auto" /></p>
<blockquote>
<p>The write buffer is made internally when calling out.Write(body) Go internally writes to a buffer before flushing to disk to decrease IO calls which are very expensive.</p>
</blockquote>
<p>The main goal and what I love the most about this is that it’s just plain creativity! Instead of having to download then transfer we can just skip the whole download part and as the bytes come in we write them to the file having minimal memory overhead. Let’s translate this to our code and check the memory stats after updating.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">downloadFile</span><span class="hljs-params">(filepath <span class="hljs-keyword">string</span>, url <span class="hljs-keyword">string</span>)</span> <span class="hljs-title">error</span></span> {
    out, err := os.Create(filepath)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> out.Close()

    resp, err := http.Get(url)
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> err
    }
    <span class="hljs-keyword">defer</span> resp.Body.Close()
    io.Copy(out, resp.Body) <span class="hljs-comment">// STREAM THE BODY TO FILE</span>

    printMemStats()

    <span class="hljs-keyword">return</span> err
}
</code></pre>
<p>Now <code>io.Copy</code> internally buffers between source and destination</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735311043584/3181a5aa-2552-444c-9e75-b4e6cee2cfc6.png" alt class="image--center mx-auto" /></p>
<p>Running the code the memory stats are as follows;</p>
<pre><code class="lang-go">Alloc: <span class="hljs-number">0.549</span> MB
TotalAlloc: <span class="hljs-number">0.549</span> MB
Sys: <span class="hljs-number">8.209</span> MB
HeapAlloc: <span class="hljs-number">0.549</span> MB
HeapSys: <span class="hljs-number">3.776</span> MB
</code></pre>
<p>Almost a 99% decrease in memory! Not only is it memory efficient it's much faster as well because we skipped a whole step!</p>
<p>Not only is this used for files or downloads, also can be used when passing blobs as function arguments as well. Because they’d copy by value so get duplicated for the function we can pass instead a reader that reads from source to destination and process as you go.</p>
<h1 id="heading-summary">Summary</h1>
<p>Most modern open source applications out there use tricks like these to optimize for memory and performance. Skipping buffers and unnecessary overhead operations when they can. It’s the deep understanding of what goes on behind the scenes that helps open gaps for optimzation. When you visualize it as well its much more clearer and gives guidance on what to do.</p>
]]></content:encoded></item><item><title><![CDATA[A backend engineer lost in the DevOps world - Authentication and Authorization in Kubernetes]]></title><description><![CDATA[Introduction
Hello folks! In this one we’re going all in on authorization and authentication in Kubernetes. Whenever you get access to a Kubernetes cluster in your job do you ever wonder what happens behinds the scenes? The DevOps guy just sends you ...]]></description><link>https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-authentication-and-authorization-in-kubernetes</link><guid isPermaLink="true">https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-authentication-and-authorization-in-kubernetes</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[backend]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sat, 21 Dec 2024 14:05:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1734789910119/73163c69-4169-400b-9bc1-8b37f023ef91.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Hello folks! In this one we’re going all in on authorization and authentication in Kubernetes. Whenever you get access to a Kubernetes cluster in your job do you ever wonder what happens behinds the scenes? The DevOps guy just sends you a kubeconfig yaml file and you start using it. We’ll be going through the basics of Authentication and Authorization in Kubernetes covering 3 main parts:</p>
<ol>
<li><p>Kubernetes Authentication Workflow</p>
</li>
<li><p>Authentication Methods in Kubernetes</p>
</li>
<li><p>User vs. Pod Authentication</p>
</li>
</ol>
<h1 id="heading-authentication-workflow">Authentication Workflow</h1>
<p>The Kubernetes authentication workflow serves as the first line of defense, verifying "who" is making the request to the Kubernetes API server which is the main entry point for all Kubernetes operations. Every request to the cluster, whether it’s from a <code>kubectl</code> command, a CI/CD pipeline, or an application, must pass through the API server.</p>
<p>Any request that comes in passes by the first phase which is <strong>Identity Assertion</strong>. Usually the request would carry any of the following:</p>
<ul>
<li><p><strong>Client Certificates</strong>: TLS certificates that the API server can validate.</p>
</li>
<li><p><strong>Bearer Tokens</strong>: Passed in the <code>Authorization</code> header of the HTTP request.</p>
</li>
<li><p><strong>Custom Mechanisms</strong>: OpenID Connect (OIDC) (AWS IAM) , etc .</p>
</li>
</ul>
<p>Kubernetes supports a pluggable authentication mechanism. The API server evaluates the configured authentication plugins in the following order:</p>
<ul>
<li><p>Static Token File.</p>
</li>
<li><p>Client Certificate Authentication.</p>
</li>
<li><p>Webhook Token Authentication.</p>
</li>
<li><p>OpenID Connect (OIDC).</p>
</li>
<li><p>HTTP Proxy Authentication.</p>
</li>
</ul>
<p>If a plugin authenticates the request, the process stops, and the identity is assigned to the request and it moves to the Authorization phase of the request. Otherwise it returns a 401 Unauthorized.</p>
<p>Once authentication confirms the identity, the request moves to the <strong>authorization</strong> phase. In this phase:</p>
<ul>
<li><p>Kubernetes evaluates whether the authenticated user has permission to perform the requested action on the specified resource.</p>
</li>
<li><p>Authorization mechanisms include for example Role-Based Access Control (RBAC)</p>
</li>
</ul>
<p>Once authorized the request is then being processed and returned to the caller.</p>
<h1 id="heading-authentication-methods-in-kubernetes">Authentication Methods in Kubernetes</h1>
<p>As mentioned above there exist different authentication methods in Kubernetes, these methods cater to both user (human) and machine authentication. Here’s an overview of the main methods:</p>
<ul>
<li><p><strong>Static Token File</strong> where tokens are stored in a static file provided by the Kubernetes API server. These tokens never change until changed manually. Not ideal for production environments as the token could get stolen by an attacker and used for malicious intentions.</p>
</li>
<li><p><strong>Service Account Tokens</strong> where tokens are automatically generated and mounted inside pods, stored in Kubernetes Secrets<strong>.</strong> These tokens are used with Kubernetes Service Accounts which are tied to specific permissions, Ideal for pod-to-API server communication, especially for applications running inside the cluster. Before Kubernetes 1.21 they used to be static but after they are renewable and bound to stricter audiences and permissions.</p>
</li>
<li><p><strong>Client Certificate Authentication</strong> uses TLS certificates to verify identities, ensuring that only trusted entities can access the Kubernetes API server. Can be used to authenticate humans alone or with service accounts together. A certificate authority generates and signs a user’s certificate and validates against it.</p>
</li>
<li><p><strong>Custom mechanisms</strong> include OpenID Connect where Kubernetes asks an external OID provider for authentication instead of using its native one.</p>
</li>
</ul>
<h1 id="heading-user-vs-pod-authentication">User vs Pod Authentication</h1>
<p>Kubernetes provides distinct mechanisms to authenticate <strong>users</strong> (human operators or external tools) and <strong>pods</strong> (workloads running within the cluster).</p>
<h2 id="heading-user-authentication">User Authentication</h2>
<p>User authentication refers to how human users or external tools (e.g., CI/CD pipelines) authenticate with the Kubernetes API server to perform actions like managing resources. Methods include:</p>
<ol>
<li><p>Certificate Authentication</p>
</li>
<li><p>Static Tokens (Not recommended)</p>
</li>
<li><p>OpenID Connect (OIDC)</p>
</li>
</ol>
<p>Kubernetes doesn’t manage users directly; external systems (e.g., certificates, identity providers) are required.</p>
<h2 id="heading-pod-authentication">Pod Authentication</h2>
<p>Pod authentication refers to how workloads running inside the cluster (e.g., pods) authenticate with the Kubernetes API server to perform actions like reading secrets or interacting with resources.</p>
<p>The main way to do this is by service accounts, where each pod is automatically assigned a service account and the tokens get mounted into the pod.</p>
<p>These tokens are used by applications within the pod to authenticate with the API server.</p>
<p>The benefits of this is that its namespace scoped and really controlled.</p>
<h1 id="heading-summary">Summary</h1>
<p>There you have it! a brief introduction about authentication and authorization in Kubernetes, I think as a Backend Developer its important to understand these concepts and appreciate how dynamic and flexible Kubernetes made it. The designs can carry on to your day to day work.</p>
<h1 id="heading-whats-next">What’s Next?</h1>
<p>I’ll be doing a demo where image someone new joined the company and needs minimal access to the cluster. I’ll be walking through the basic steps to do so. See you there!</p>
]]></content:encoded></item><item><title><![CDATA[A backend engineer lost in the DevOps world - Auto Scaling In Kubernetes]]></title><description><![CDATA[Introduction
Hello folks and welcome to the second part of the series I made where I discover DevOps concepts that I wanted to understand as a backend engineer. In this one we dive into Kubernetes AutoScaling where we’ll be going through its basics a...]]></description><link>https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-auto-scaling-in-kubernetes</link><guid isPermaLink="true">https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-auto-scaling-in-kubernetes</guid><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[scalability]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[backend]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Thu, 28 Nov 2024 19:28:20 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732822048093/c7624399-bd96-40b4-b985-3ebcc0a4df9e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Hello folks and welcome to the second part of the series I made where I discover DevOps concepts that I wanted to understand as a backend engineer. In this one we dive into Kubernetes AutoScaling where we’ll be going through its basics and testing it to make sure we understand everything going on. Let’s start!</p>
<h1 id="heading-autoscaler-basics">AutoScaler Basics</h1>
<h2 id="heading-metrics-server">Metrics Server</h2>
<p>An auto scaler scales automatically when certain metrics reach an agreed upon threshold. Simple right? there’s a lot more to it though</p>
<p>AutoScaler relies on a metrics source in order to actually watch for metric changes. Kubernetes uses a component called the <strong>Metrics Server</strong> to collect resource metrics (like CPU and memory usage) for <strong>pods</strong> and <strong>nodes</strong> in a cluster. The Metrics Server aggregates these metrics and makes them available to components like the <strong>Horizontal Pod Autoscaler (HPA)</strong> and other monitoring tools.</p>
<ul>
<li><p>The <strong>Metrics Server</strong> is a lightweight, cluster-wide aggregator of resource usage data (like CPU and memory) for nodes and pods.</p>
</li>
<li><p>It <strong>does not</strong> store historical data — it only provides the current resource usage (live metrics).</p>
</li>
<li><p>The Metrics Server collects data from the <strong>kubelet</strong> (the primary node agent that runs on each node).</p>
</li>
</ul>
<p>The kubelet exposes the metrics of the nodes’ containers/pods on port <code>10250</code> <code>/metrics/resource</code></p>
<p>In managed Kubernetes Enviroments (EKS, GKE) the metrics server by default is installed in the cluster. However if you’re using something like kind or Minikube it isn’t installed by default. To install it to your local cluster</p>
<p><code>kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml</code></p>
<p>And Verify using this command &amp; make sure it’s running.</p>
<p><code>kubectl get deployment metrics-server -n kube-system</code></p>
<blockquote>
<p>the command <code>kubectl top pods</code> is only available to use once the metrics server is installed and running</p>
</blockquote>
<p>Now we have a metrics server pulling metrics from every nodes’ kubelet. We need to do something with this information, yep you probably guessed it, autoscaling!!</p>
<h2 id="heading-nginx-deployment-example">Nginx Deployment (Example)</h2>
<p>Before moving forward to the actual autoscaling, our example will include a simple nginx deployment where we will monitor CPU usage and add an autoscaler to this deployment. We will do a infinite while loop hitting requests to the nginx server and watch the autoscaling happen.</p>
<p>This is the deployment/service manifest file for nginx</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-deployment</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">nginx</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">nginx</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">nginx</span>
        <span class="hljs-attr">image:</span> <span class="hljs-string">nginx:latest</span>
        <span class="hljs-attr">resources:</span>
          <span class="hljs-attr">requests:</span>
            <span class="hljs-attr">cpu:</span> <span class="hljs-string">"100m"</span>
            <span class="hljs-attr">memory:</span> <span class="hljs-string">"128Mi"</span>
          <span class="hljs-attr">limits:</span>
            <span class="hljs-attr">cpu:</span> <span class="hljs-string">"200m"</span>
            <span class="hljs-attr">memory:</span> <span class="hljs-string">"256Mi"</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-service</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">nginx</span>
  <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">80</span>
      <span class="hljs-attr">targetPort:</span> <span class="hljs-number">80</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
</code></pre>
<p>Each pod has a request of <code>100 millicores CPU (0.1 Core)</code> and <code>128 Mebibytes</code> and a limit of <code>0.2 Core &amp; 256 Mebibytes</code></p>
<blockquote>
<p>Mebibytes are binary units (unlike megabytes which is decimal). In computing, memory is inherently binary (base-2). For example, RAM sizes are measured in powers of 2 (e.g., 512 MiB, 1 GiB). <strong>MegaBytes</strong> can cause confusion because its decimal nature doesn’t match binary-based memory calculations. 1 Mebibyte =1024 ² Bytes</p>
</blockquote>
<h2 id="heading-hpa-manifest">HPA Manifest</h2>
<p>The basic autoscaling manifest looks something like this (In this context we have an nginx deployment where we’re going to apply HPA to it)</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">autoscaling/v2</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">HorizontalPodAutoscaler</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-hpa</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">scaleTargetRef:</span>
    <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-deployment</span>
  <span class="hljs-attr">minReplicas:</span> <span class="hljs-number">2</span>
  <span class="hljs-attr">maxReplicas:</span> <span class="hljs-number">5</span>
  <span class="hljs-attr">metrics:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">Resource</span>
      <span class="hljs-attr">resource:</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">cpu</span>
        <span class="hljs-attr">target:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">Utilization</span>
          <span class="hljs-attr">averageUtilization:</span> <span class="hljs-number">20</span>
</code></pre>
<p>When the <strong>average CPU utilization of all current running pods</strong> of the nginx deployment exceeds 20% (Just an example not practical obviously) autoscaling starts.</p>
<p>Let’s say we have 2 replicas already running where both have a cpu utilization of 25%</p>
<p>Thus the average cpu utilization is:</p>
<p>(<code>utilization of first pod + utilization of second pod)/ 2 (pod count)</code> = <code>25%</code></p>
<p>In order to know how many more replicas we need so our average utilization becomes under 20% again we can see the ratio between the actual and target utilizations <code>25/20 = 1.25</code></p>
<p>Meaning, the <strong>actual utilization is 1.25 x the target utilization. (1.25 times higher than the target)</strong></p>
<p>By multiplying the scaling factor (1.25 in this case) by the <strong>current number of replicas</strong>, you determine how many replicas are needed to bring the CPU utilization down to the target. In our case 2 replicas so (2×1.25=2.5) and we <strong>ceil</strong> that because we can’t create a fraction of a pod so it’s ceiled to 3 so after scaling up we should have 3 replicas instead of 2.</p>
<p>The manifest file for HPA above is a very simple implementation. There’s a lot of configurations regarding scaling down (how long should we wait to scale down again) and other important configs but for the sake of the article we’ll go simple.</p>
<p>Upon applying the above manifests. We should have HPA installed to our nginx deployment.</p>
<h2 id="heading-testing">Testing</h2>
<p>We can test using the <code>BusyBox</code> image where it gives us a shell to execute a while loop where we send requests to the nginx web server.</p>
<p><code>kubectl run busybox --image=busybox --rm -it -- /bin/sh</code></p>
<p>Then inside the shell</p>
<p><code>while true; do wget -q -O-</code> <a target="_blank" href="http://nginx-deployment"><code>http://nginx-deployment</code></a><code>; done</code></p>
<p>If we execute <code>kubectl get hpa</code> we can find a <code>TARGETS</code> section <code>x%/20%</code> Which means the current utilization over the threshold utilization specified in the HPA manifest.</p>
<p>We can monitor and check that once the current passes the threshold new replicas are created according to the equation mentioned above!</p>
<h2 id="heading-auto-scaling-down">Auto Scaling Down</h2>
<p>When a Kubernetes HPA is configured, it monitors certain metrics (like CPU or memory utilization) at regular intervals (typically every 30 seconds). If the current resource usage falls below the defined target utilization threshold, the HPA will scale down the number of pods.</p>
<h4 id="heading-scaling-down-criteria"><strong>Scaling Down Criteria:</strong></h4>
<ul>
<li><p><strong>Target Utilization vs. Current Resource Usage:</strong> where if it finds the current less than the target it will scale down</p>
</li>
<li><p><strong>MinReplicas</strong>: where it doesn’t scale down this number</p>
</li>
<li><p><strong>Cooldown Period (Stabilization Window):</strong> Kubernetes doesn’t immediately scale down when resource usage decreases slightly. It has a built-in <strong>stabilization period</strong>to avoid <strong>flapping</strong>, which is when scaling occurs rapidly back and forth.</p>
</li>
</ul>
<p>HPA manifest has a <code>behavior</code> section which allows you to specify custom scaling behavior, including how quickly to scale down.</p>
<p>In our HPA manifest we can update it as follows:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">autoscaling/v2</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">HorizontalPodAutoscaler</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-hpa</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">scaleTargetRef:</span>
    <span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">nginx-deployment</span>
  <span class="hljs-attr">minReplicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">maxReplicas:</span> <span class="hljs-number">5</span>
  <span class="hljs-attr">metrics:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">Resource</span>
      <span class="hljs-attr">resource:</span>
        <span class="hljs-attr">name:</span> <span class="hljs-string">cpu</span>
        <span class="hljs-attr">target:</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">Utilization</span>
          <span class="hljs-attr">averageUtilization:</span> <span class="hljs-number">20</span>
  <span class="hljs-attr">behavior:</span>
    <span class="hljs-attr">scaleDown:</span>
      <span class="hljs-attr">policies:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">Percent</span>
          <span class="hljs-attr">value:</span> <span class="hljs-number">20</span>
          <span class="hljs-attr">periodSeconds:</span> <span class="hljs-number">60</span>
<span class="hljs-comment"># Scale down by 20% every 60 seconds</span>
</code></pre>
<p>The behavior here is scale down by 20% every 60 seconds. The default stabilizationWindowSeconds is 5 minutes but can be configured too.</p>
<h1 id="heading-summary">Summary</h1>
<p>In this article we took a look at how HPA works in Kubernetes according to metrics such as CPU and memory, we looked at how we get these metrics using the metrics server and how we utilize them in scaling up and down. I’ll make a part two of this where we auto scale based on <strong>custom</strong> metrics such as response percentiles and load. This will require extra work but worth it for the content I guess. See you in the next one!</p>
]]></content:encoded></item><item><title><![CDATA[A backend engineer lost in the DevOps world - Making a Kubernetes Operator with Go]]></title><description><![CDATA[Introduction
Hello folks! This will be a series of articles where I try diving into complex Devops topics simplifying them for us backend engineers and making the article serve as a quick recap for whoever is interested. In this article we’ll dive in...]]></description><link>https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-making-a-kubernetes-operator-with-go</link><guid isPermaLink="true">https://hewi.blog/a-backend-engineer-lost-in-the-devops-world-making-a-kubernetes-operator-with-go</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Go Language]]></category><category><![CDATA[Devops]]></category><category><![CDATA[backend]]></category><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Fri, 22 Nov 2024 13:21:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732281621838/45f42a78-cd0f-437c-a24b-ed8e5d4b3dca.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>Hello folks! This will be a series of articles where I try diving into complex Devops topics simplifying them for us backend engineers and making the article serve as a quick recap for whoever is interested. In this article we’ll dive into Kubernetes operators and we’ll be making our first operator using Go (not from scratch but we’ll use a handy tool called <code>kubebuilder</code> that adds a boilerplate so we can focus on what we should focus on only). Before moving forward let’s talk about what operators even are.</p>
<h1 id="heading-kubernetes-operators">Kubernetes Operators</h1>
<p>A <strong>Kubernetes Operator</strong> is a method of automating the management of complex applications on Kubernetes, a typical operator consists of:</p>
<ol>
<li><p><strong>Controller</strong>: A program that watches and reacts to changes in Kubernetes resources (like Custom Resources) and takes actions to manage the application’s lifecycle (e.g., creating, updating, or deleting resources).</p>
</li>
<li><p><strong>Custom Resource (CR)</strong>: A custom-defined object that represents the application or service the operator manages. It defines the desired state (e.g., number of replicas, configuration) of that application.</p>
</li>
</ol>
<p>It automates the management of complex applications by using a controller to watch a custom resource and ensure the application matches the desired state.</p>
<p>Now if you were like me first time reading this, you probably didn’t understand anything. Now its time to simplify this even further</p>
<p>The operator is an umbrella for 2 main things:</p>
<ol>
<li><p>A custom resource; which Is a new type of object that Kubernetes doesn’t even know about (for example, <strong>pod</strong> is a resource) (<strong>pod-stalker</strong> is a custom resource because it isn’t natively installed in Kubernetes)</p>
<p> So we create new custom resources that have a defined <strong>schema</strong>. This will become much clearer when we implement the actual operator.</p>
</li>
<li><p>The controller which is the brain of the operator and where the main code lies. The controller watches for changes in the custom resource and does some logic based on what happened. This process is called <strong>reconciliation</strong> where the goal is always get back to the desired state from the current one. That desired state is specified in something called a <strong>Custom Resource Definition.</strong> Which is just a YAML file actually initializing the custom resource filling in the schema.</p>
</li>
</ol>
<p>Enough with the theoretical stuff let’s do a walkthrough of a cool project. We will create a custom resource <code>PodTracker</code> which will watch over some pods with a specified name and will monitor any pod creations in the <code>default</code> namespace and send a message on a slack channel notifying that creation happened.</p>
<h1 id="heading-podtracker-walkthrough">PodTracker Walkthrough</h1>
<p>To get started first install <a target="_blank" href="https://book.kubebuilder.io/quick-start.html"><code>kubebuilder</code></a></p>
<p><strong>Kubebuilder</strong> is a framework for building Kubernetes Operators and Custom Controllers. It provides a set of tools and libraries to help you easily create, test, and manage Kubernetes Operators, which automate the management of complex applications on Kubernetes.</p>
<p>This will give us a great scaffold to start off of.</p>
<p>Once kubebuilder is installed let’s run the commands to create a scaffold</p>
<pre><code class="lang-bash"><span class="hljs-comment"># initialize a new kubebuilder project in Go. </span>
kubebuilder init --domain lost.backend --repo lost.backend/pod-tracker
<span class="hljs-comment"># creates the API responsible for our new Custom Resource</span>
kubebuilder create api --group pod-tracker --version v1 --kind PodTracker
</code></pre>
<p>Once installed the file structure should look something like this</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732276164301/4fef84b3-00df-4a56-843e-5635e3f5a2d6.png" alt class="image--center mx-auto" /></p>
<p>We’re only going to be concerned with two main files</p>
<ul>
<li><p>`</p>
</li>
<li><p><code>api/v1/podtracker_types.go</code></p>
</li>
</ul>
<p>Let’s start by defining our Custom Resource Schema first</p>
<h2 id="heading-custom-resource-schema">Custom Resource Schema</h2>
<p>Inside <code>api/v1/podtracker_types.go</code></p>
<p>You’ll find a structure that looks like the following</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> PodTrackerSpec <span class="hljs-keyword">struct</span> {
}

<span class="hljs-comment">// PodTrackerStatus defines the observed state of PodTracker.</span>
<span class="hljs-keyword">type</span> PodTrackerStatus <span class="hljs-keyword">struct</span> {

}

<span class="hljs-comment">// +kubebuilder:object:root=true</span>
<span class="hljs-comment">// +kubebuilder:subresource:status</span>

<span class="hljs-comment">// PodTracker is the Schema for the podtrackers API.</span>
<span class="hljs-keyword">type</span> PodTracker <span class="hljs-keyword">struct</span> {
    metav1.TypeMeta   <span class="hljs-string">`json:",inline"`</span>
    metav1.ObjectMeta <span class="hljs-string">`json:"metadata,omitempty"`</span>

    Spec   PodTrackerSpec   <span class="hljs-string">`json:"spec,omitempty"`</span>
    Status PodTrackerStatus <span class="hljs-string">`json:"status,omitempty"`</span>
}
</code></pre>
<p><code>PodTrackerSpec</code> is for the desired schema of the pod tracker. We will have the following:</p>
<ul>
<li><p><code>Name</code> Field that ensures that this pod tracker would track pods with that given name (For simplicity, usually pod names are unique so a better approach is to add a deployment name and monitor changes for pods that belong to a specific deployment)</p>
</li>
<li><p><code>Reporter</code> Which is a struct that contains <code>Kind, Key &amp; Channel</code> describing the kind of reporting (in our case slack, the api key of it and the channel to post on)</p>
</li>
</ul>
<p><code>PodTrackerStatus</code> defines the current observed state of the PodTracker. Which is usually updated in reconciliation. (the controller updates it according to events that occur). Let’s leave this empty for now.</p>
<p>our types file should look like this now</p>
<pre><code class="lang-go"><span class="hljs-keyword">type</span> Reporter <span class="hljs-keyword">struct</span> {
    Kind    <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"kind,omitempty"`</span>
    Key     <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"key,omitempty"`</span>
    Channel <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"channel,omitempty"`</span>
}
<span class="hljs-keyword">type</span> PodTrackerSpec <span class="hljs-keyword">struct</span> {
    Name     <span class="hljs-keyword">string</span>   <span class="hljs-string">`json:"name,omitempty"`</span>
    Reporter Reporter <span class="hljs-string">`json:"reporter,omitempty"`</span>
}

<span class="hljs-comment">// PodTrackerStatus defines the observed state of PodTracker.</span>
<span class="hljs-keyword">type</span> PodTrackerStatus <span class="hljs-keyword">struct</span> {
    PodCount <span class="hljs-keyword">int</span>    <span class="hljs-string">`json:"podCount,omitempty"`</span>
    Status   <span class="hljs-keyword">string</span> <span class="hljs-string">`json:"status,omitempty"`</span>
}
</code></pre>
<p><strong>NOTE that it’s essential to add the json annotations otherwise it won’t compile properly.</strong></p>
<p>Now since we defined our Custom Resource. It’s time to actually write a definition for it. Something like this</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># tracker.yaml</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">"pod-tracker.lost.backend/v1"</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">"PodTracker"</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">pod-tracker</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">"nginx"</span>
  <span class="hljs-attr">reporter:</span>
    <span class="hljs-attr">kind:</span> <span class="hljs-string">"slack"</span>
    <span class="hljs-attr">key:</span> <span class="hljs-string">"slack-api-key (will post link on how to)"</span>
    <span class="hljs-attr">channel:</span> <span class="hljs-string">"C0821GM4602 (channel ID)"</span>
</code></pre>
<p>This is a manifest where we can use <code>kubectl apply -f tracker.yaml</code> to apply this manifest and have our first pod-tracker object running!</p>
<p>Before doing <code>kubectl apply</code> we actually need to install the custom resource created in a local Kubernetes cluster. Make sure you have one running using <a target="_blank" href="https://kind.sigs.k8s.io"><mark>kind</mark></a> for example.</p>
<p>We can install the Custom Resource by executing <code>make install</code> inside the project directory.</p>
<p>If we execute the <code>kubectl apply</code> command above we’ll find out that we have a pod-tracker resource instance already up and running</p>
<p>Check by <code>kubectl get podtracker</code></p>
<p>However it’s just a resource instance running and it doesn't do anything useful (for now). Now comes time to actually add the brain to this resource using the second important file we mentioned and that is our controller file!</p>
<h2 id="heading-custom-controller">Custom Controller</h2>
<p>In <code>internal/controller/podtracker_controller.go</code> we should have a method called <code>Reconcile</code> that looks as follows:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(r *PodTrackerReconciler)</span> <span class="hljs-title">Reconcile</span><span class="hljs-params">(ctx context.Context, req ctrl.Request)</span> <span class="hljs-params">(ctrl.Result, error)</span></span> {

    <span class="hljs-keyword">return</span> ctrl.Result{}, <span class="hljs-literal">nil</span>
}g
</code></pre>
<p><code>Reconcile</code> Takes in what is known as a reconciliation request (a request to trigger this method basically) and executes the logic inside the method to reconcile the custom resource to the desired state.</p>
<p>It automatically gets called when events happen on <code>PodTracker</code> Resource (Creating, updating, deleting, etc)</p>
<p>Controlling what triggers the reconcile method basically lies within the second method we have here</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(r *PodTrackerReconciler)</span> <span class="hljs-title">SetupWithManager</span><span class="hljs-params">(mgr ctrl.Manager)</span> <span class="hljs-title">error</span></span> {
    <span class="hljs-keyword">return</span> ctrl.NewControllerManagedBy(mgr).
        For(&amp;podtrackerv1.PodTracker{}). <span class="hljs-comment">// Your primary resource</span>
        Watches(&amp;corev1.Pod{}, handler.EnqueueRequestsFromMapFunc(r.HandlePodEvents)).
        WithEventFilter(predicate.Funcs{
            CreateFunc: <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">(e event.CreateEvent)</span> <span class="hljs-title">bool</span></span> {
                <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span> <span class="hljs-comment">// Process only create events</span>
            },
            UpdateFunc: <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">(e event.UpdateEvent)</span> <span class="hljs-title">bool</span></span> {
                <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span> <span class="hljs-comment">// Ignore updates</span>
            },
            DeleteFunc: <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">(e event.DeleteEvent)</span> <span class="hljs-title">bool</span></span> {
                <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span> <span class="hljs-comment">// Ignore deletions</span>
            },
            GenericFunc: <span class="hljs-function"><span class="hljs-keyword">func</span><span class="hljs-params">(e event.GenericEvent)</span> <span class="hljs-title">bool</span></span> {
                <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span> <span class="hljs-comment">// Ignore generic events</span>
            },
        }).
        Complete(r)
}
</code></pre>
<p>In this method we basically watch for changes both in the PodTracker and Pod resources. Only create events are allowed to get processed and we discard the rest.</p>
<p>We pass the Pod events to a method <code>r.HandlePodEvents</code> which finds its PodTracker object and enqueues a reconciliation request for that PodTracker object which in turn does its job and sends a message to slack that a new pod has been created.</p>
<p>This is how <code>HandlePodEvents</code> looks like</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(r *PodTrackerReconciler)</span> <span class="hljs-title">HandlePodEvents</span><span class="hljs-params">(ctx context.Context, o client.Object)</span> []<span class="hljs-title">ctrl</span>.<span class="hljs-title">Request</span></span> {
    <span class="hljs-comment">// Check if the object is a pod if not ignore</span>
    pod, ok := o.(*corev1.Pod)
    <span class="hljs-keyword">if</span> !ok {
        <span class="hljs-keyword">return</span> []ctrl.Request{}
    }
    <span class="hljs-comment">// check if the object lies in the kubernetes default namespace otherwise ignore</span>
    <span class="hljs-keyword">if</span> pod.Namespace != <span class="hljs-string">"default"</span> {
        <span class="hljs-keyword">return</span> []ctrl.Request{}
    }

    <span class="hljs-comment">// get the list of PodTracker objects</span>
    podTrackerList := &amp;podtrackerv1.PodTrackerList{}
    <span class="hljs-comment">// if none are found ignore.</span>
    <span class="hljs-keyword">if</span> err := r.List(ctx, podTrackerList); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> []ctrl.Request{}
    }

    ctrlRequests := []ctrl.Request{}
    <span class="hljs-comment">// iterate over the list of PodTracker objects</span>
    <span class="hljs-keyword">for</span> _, podTracker := <span class="hljs-keyword">range</span> podTrackerList.Items {
        <span class="hljs-comment">// check if the PodTracker object is watching the pod</span>
        <span class="hljs-keyword">if</span> podTracker.Spec.Name == pod.Name {
            ctrlRequests = <span class="hljs-built_in">append</span>(ctrlRequests, ctrl.Request{NamespacedName: client.ObjectKeyFromObject(&amp;podTracker)})
        }
    }

    <span class="hljs-keyword">return</span> ctrlRequests
}
</code></pre>
<p>We simply just get the podTracker objects and check which one of them is responsible for managing the currently created pod. When we find it we enqueue a request to reconcile that specific podTracker using <code>NamespacedName</code> which is an object useful for passing to the reconciliation request. It contains the resource name and namespace.</p>
<p><code>ctrlRequests</code> could potentially be an array of reconciliation requests which means the method would be invoked as many times as the length of ctrlRequests respectively.</p>
<p>Now in the main reconciliation method I added this.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(r *PodTrackerReconciler)</span> <span class="hljs-title">Reconcile</span><span class="hljs-params">(ctx context.Context, req ctrl.Request)</span> <span class="hljs-params">(ctrl.Result, error)</span></span> {
    fmt.Println(<span class="hljs-string">"Reconciling PodTracker"</span>)
    podTracker := &amp;podtrackerv1.PodTracker{}
    <span class="hljs-keyword">if</span> err := r.Get(ctx, req.NamespacedName, podTracker); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> ctrl.Result{}, client.IgnoreNotFound(err)
    }

    <span class="hljs-comment">// send to slack an update that a pod with the name was created</span>
    lib.SlackSendMessage(podTracker.Spec.Reporter.Key, podTracker.Spec.Reporter.Channel, <span class="hljs-string">"Pod "</span>+podTracker.Spec.Name+<span class="hljs-string">" was created"</span>)

    <span class="hljs-keyword">return</span> ctrl.Result{}, <span class="hljs-literal">nil</span>
}
</code></pre>
<p>I check if the - to be reconciled - pod tracker exists if it does I send a slack message using this function I made and return.</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">SlackSendMessage</span><span class="hljs-params">(key <span class="hljs-keyword">string</span>, channelID <span class="hljs-keyword">string</span>, message <span class="hljs-keyword">string</span>)</span></span> {
    api := slack.New(key)
    _, _, err := api.PostMessage(channelID, slack.MsgOptionText(message, <span class="hljs-literal">false</span>))
    <span class="hljs-keyword">if</span> err != <span class="hljs-literal">nil</span> {
        fmt.Println(err)
    }
}
<span class="hljs-comment">// just a simple function that sends a message to a channel</span>
</code></pre>
<p>You provide the slack credentials in the custom resource manifest we did earlier. To get these credentials make sure you create a new slack app and follow the instructions here:</p>
<ol>
<li><p>Install the go library for slack <code>slack-go/slack</code></p>
</li>
<li><p>Go to the Slack API Apps page.</p>
</li>
<li><p>Create a new app from scratch</p>
</li>
<li><p>Navigate to <strong>"OAuth &amp; Permissions"</strong> in your app's settings.</p>
</li>
<li><p>Add the necessary <strong>OAuth scopes</strong> for your app based on what it needs to do. In our case <code>chat:write</code></p>
</li>
<li><p>Go to <strong>"Install App"</strong> under the slack settings.</p>
</li>
<li><p>Click <strong>"Install App to Workspace"</strong>.</p>
</li>
<li><p>Authorize the app with your workspace.</p>
</li>
<li><p>After installation, you’ll see an <strong>OAuth token</strong> in the "OAuth &amp; Permissions" section.</p>
<ul>
<li><p>The token starts with <code>xoxb-</code> (for bot tokens) or <code>xoxp-</code> (for user tokens).</p>
</li>
<li><p>This will be your key.</p>
</li>
</ul>
</li>
<li><p>To get the channel id just check channel details in your slack app for the channel you want to write to it should be at the very bottom of channel details.</p>
</li>
</ol>
<p>If you got all these steps correct execute <code>make install</code> again to compile the code and <code>make run</code> to test the controller logic</p>
<p>if we try to create an nginx pod using <code>kubectl run nginx —image=nginx</code></p>
<p>we should get a slack notification 🎉</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732280710917/96fec197-0ae5-40b1-9f4c-f5e670e8e098.png" alt class="image--center mx-auto" /></p>
<p>However our current code has a problem where if a new PodTracker is created. The reconcile method will run sending a slack message for a PodTracker resource creation. We only track created pods so this behavior is unwanted.</p>
<p>To be able to solve this we can use annotations! That’s where their power comes in. We can annotate PodTracker objects that actually need reconciliation because of a pod creation and not because the pod tracker itself is created. we can update the <code>HandlePodEvents</code> method to as follows:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(r *PodTrackerReconciler)</span> <span class="hljs-title">HandlePodEvents</span><span class="hljs-params">(ctx context.Context, o client.Object)</span> []<span class="hljs-title">ctrl</span>.<span class="hljs-title">Request</span></span> {
    pod, ok := o.(*corev1.Pod)
    <span class="hljs-keyword">if</span> !ok {
        <span class="hljs-keyword">return</span> []ctrl.Request{}
    }

    <span class="hljs-keyword">if</span> pod.Namespace != <span class="hljs-string">"default"</span> {
        <span class="hljs-keyword">return</span> []ctrl.Request{}
    }

    <span class="hljs-comment">// get the list of PodTracker objects</span>
    podTrackerList := &amp;podtrackerv1.PodTrackerList{}
    <span class="hljs-keyword">if</span> err := r.List(ctx, podTrackerList); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> []ctrl.Request{}
    }

    ctrlRequests := []ctrl.Request{}
    <span class="hljs-comment">// iterate over the list of PodTracker objects</span>
    <span class="hljs-keyword">for</span> _, podTracker := <span class="hljs-keyword">range</span> podTrackerList.Items {
        <span class="hljs-comment">// check if the PodTracker object is watching the pod</span>
        <span class="hljs-keyword">if</span> podTracker.Spec.Name == pod.Name {
            <span class="hljs-keyword">if</span> podTracker.Annotations == <span class="hljs-literal">nil</span> {
                podTracker.Annotations = <span class="hljs-keyword">map</span>[<span class="hljs-keyword">string</span>]<span class="hljs-keyword">string</span>{}
            }
        <span class="hljs-comment">// add annotation to check for in the reconcilation</span>
            podTracker.Annotations[<span class="hljs-string">"triggered-by"</span>] = <span class="hljs-string">"pod"</span>
        <span class="hljs-comment">// update the kubectl cluster podtracker object with the new annotation</span>
            <span class="hljs-keyword">if</span> err := r.Update(ctx, &amp;podTracker); err != <span class="hljs-literal">nil</span> {
                log.FromContext(ctx).Error(err, <span class="hljs-string">"Failed to update PodTracker with annotations"</span>)
                <span class="hljs-keyword">continue</span>
            }
            ctrlRequests = <span class="hljs-built_in">append</span>(ctrlRequests, ctrl.Request{NamespacedName: client.ObjectKeyFromObject(&amp;podTracker)})
        }
    }

    <span class="hljs-keyword">return</span> ctrlRequests
}
</code></pre>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-params">(r *PodTrackerReconciler)</span> <span class="hljs-title">Reconcile</span><span class="hljs-params">(ctx context.Context, req ctrl.Request)</span> <span class="hljs-params">(ctrl.Result, error)</span></span> {
    fmt.Println(<span class="hljs-string">"Reconciling PodTracker"</span>)
    podTracker := &amp;podtrackerv1.PodTracker{}
    <span class="hljs-keyword">if</span> err := r.Get(ctx, req.NamespacedName, podTracker); err != <span class="hljs-literal">nil</span> {
        <span class="hljs-keyword">return</span> ctrl.Result{}, client.IgnoreNotFound(err)
    }
    <span class="hljs-comment">// check the annotations if triggered by exists only send a message.</span>
    <span class="hljs-keyword">if</span> podTracker.Annotations != <span class="hljs-literal">nil</span> &amp;&amp; podTracker.Annotations[<span class="hljs-string">"triggered-by"</span>] == <span class="hljs-string">"pod"</span> {
    <span class="hljs-comment">// delete the annotation for cleanup</span>
        <span class="hljs-built_in">delete</span>(podTracker.Annotations, <span class="hljs-string">"triggered-by"</span>)
        <span class="hljs-keyword">if</span> err := r.Update(ctx, podTracker); err != <span class="hljs-literal">nil</span> {
            <span class="hljs-keyword">return</span> ctrl.Result{}, fmt.Errorf(<span class="hljs-string">"failed to clear annotation: %w"</span>, err)
        }
        lib.SlackSendMessage(podTracker.Spec.Reporter.Key, podTracker.Spec.Reporter.Channel, <span class="hljs-string">"Pod "</span>+podTracker.Spec.Name+<span class="hljs-string">" was created"</span>)
    }

    <span class="hljs-keyword">return</span> ctrl.Result{}, <span class="hljs-literal">nil</span>
}
</code></pre>
<p>And voila! now we only send slack messages of newly created pods.</p>
<h1 id="heading-summary">Summary</h1>
<p>The main goal of an article like this is that first it’s targeted to backend developers with little to know knowledge about operators. Because let’s be honest it’s something we might finish our career and never touch. It’s just an attempt from me to ease the understanding of these concepts that I personally find myself struggling with. The goal was without diving deep make a simple use case that clearly explains the idea of this. Hopefully I delivered what I wanted. Also I might make these series into a YouTube series instead if anyone wants that let me know! Till the next one</p>
]]></content:encoded></item><item><title><![CDATA[Software Architecture - The Hard Parts [Chapter 10] Distributed Data Access]]></title><description><![CDATA[Introduction
In this chapter we’re going to be diving into the different ways services can read data they do not own, in monolithic systems using a single database, developers don’t give a second thought to reading database tables but when data is br...]]></description><link>https://hewi.blog/software-architecture-the-hard-parts-chapter-10-distributed-data-access</link><guid isPermaLink="true">https://hewi.blog/software-architecture-the-hard-parts-chapter-10-distributed-data-access</guid><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[software architecture]]></category><category><![CDATA[AI]]></category><category><![CDATA[wasm]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sun, 10 Nov 2024 10:12:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731233522278/b26217a0-5344-4808-8444-f862ef47b85e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>In this chapter we’re going to be diving into the different ways services can read data they <strong>do not own</strong>, in monolithic systems using a single database, developers don’t give a second thought to reading database tables but when data is broken into separate databases owned by distinct services, data access for read operations become complex.</p>
<h1 id="heading-data-access-patterns">Data Access Patterns</h1>
<p>Below are some of the most commonly used data access patterns so that a service can access data it doesn’t own.</p>
<h2 id="heading-interservice-communication-pattern">Interservice Communication Pattern</h2>
<p>This is by far the most common pattern for accessing data, if one service needs data it doesn’t have direct access to it simply asks the owning service for it by using some sort of remote access protocol.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Pros</td><td>Cons</td></tr>
</thead>
<tbody>
<tr>
<td>Simplicity</td><td>Slower performance due to latency, especially if a user’s request is dependent on having this style of communication in the business request. Latencies mainly include <mark>network, security and data latencies</mark></td></tr>
<tr>
<td>No data volume issues (direct service calls)</td><td>Services are tightly coupled together, one service must rely on the other service being available so that it can fulfill its needs. The absence of the service that has the data directly impacts the calling service. Also they must scale together since they are tightly coupled to meet high demand.</td></tr>
</tbody>
</table>
</div><h2 id="heading-column-schema-replication-pattern">Column Schema Replication Pattern</h2>
<p>Here, columns are replicated across tables therefore replicating the data and making it available to other bounded contexts.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Pros</td><td>Cons</td></tr>
</thead>
<tbody>
<tr>
<td>The service requiring read access has immediate access to the data which increases performance, fault tolerance and scalability.</td><td>Data synchronization and consistency making sure both columns are synched together through possible asynchronous communication.</td></tr>
<tr>
<td>Very useful in some data aggregation, reporting scenarios.</td><td>Very hard to govern data ownership since data is replicated to other services they can update it even though they don’t officially own that data which can cause <strong>data consistency issues</strong></td></tr>
</tbody>
</table>
</div><h2 id="heading-replicated-caching-pattern">Replicated Caching Pattern</h2>
<p>While caching is a well known pattern for increasing performance and responsiveness, it can also be an effective way for distributed data access and sharing. Leveraging in memory caching so that data needed by other services is made available to each service without them having to ask for it.</p>
<p>There exist different models of caching between services but the basic one is the <mark>in memory caching</mark> where each service has its own cache separate from other services. This isn’t really that useful for sharing data between services because of the lack of synchronization between the services.</p>
<p>Another caching model is <mark>distributed caching </mark> where data is held externally in a caching server, services make requests to that external server to retrieve or update shared cache. However its not that useful for data access due to the following:</p>
<ol>
<li><p>No different that the inter service communication pattern (still tightly coupled to a caching server instead of a service)</p>
</li>
<li><p>Different services can update data breaking the bounded context regarding data ownership which can causes inconsistencies between caches and the owning database.</p>
</li>
<li><p>Latency issues since the way to access the cache is through network calls as described earlier.</p>
</li>
</ol>
<p>Another model is <mark>replicated caching </mark> where each service has its own in memory data that is kept in sync between the services allowing the same data to be shared across multiple services. Any update made to the cache is asynchronously propagated to the other caches in the services.</p>
<p>From the caching models mentioned replicated caching is the most suitable for addressing distributed data access.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Pros</td><td>Cons</td></tr>
</thead>
<tbody>
<tr>
<td>Services have their own in memory replica so they no longer need to make external calls to other services for data.</td><td>Service dependency with regard to the cache data and startup timing. So the service that has the replicated cache must run after the service owning the cache starts. IF that service is unavailable the other service has to go in a waiting state till the cache gets filled up. Only a startup problem though.</td></tr>
<tr>
<td>Updates made by the cache owning service will reflect to all other services containing the replica cache.</td><td>If the volumes of data are too high the feasibility of this pattern diminishes quickly. Also every service instance has its own replicated cache so if 5 instances are required that’s the cache size multiplied by 5. Careful analyzing must take place here to not hog all of the memory resources.</td></tr>
<tr>
<td>Greatly responsive, fault tolerant and scalable</td><td>Very hard to keep the caches fully in sync if the rate of change of the data is too high. The pattern is more suited for relatively static data (data that doesn’t change that often)</td></tr>
<tr>
<td>Can scale independently.</td><td>Configuration and setup management where its not that straightforward configuring this replicating mechanism together.</td></tr>
</tbody>
</table>
</div><h2 id="heading-data-domain-pattern">Data Domain Pattern</h2>
<p>In a previous chapter, a way to resolve joint ownership was to make both services share the database ownership. This same pattern can be used for data access too.</p>
<p>A solution is to create a data domain combining multiple tables into a shared schema accessible to both services needing the data which makes for a broader bounded context.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Pros</td><td>Cons</td></tr>
</thead>
<tbody>
<tr>
<td>Services completely decoupled from each other</td><td>Sharing data is generally discouraged in distributed architectures</td></tr>
<tr>
<td>Very high data consistency and integrity (no need for replication, synchronization, etc)</td><td>Since multiple services have direct access to schema, any schema changes impact directly these services and they will have to change accordingly.</td></tr>
<tr>
<td>No additional contracts needed to transfer data between services.</td><td>Potential pitfall to security issues concerning data access. Where one service has complete access to the data in that domain.</td></tr>
</tbody>
</table>
</div><h1 id="heading-summary">Summary</h1>
<p>In this chapter we went through some of the popular data access patterns where one service basically needs data from another. The trade offs of each one and the answer to which one should I pick is a big “it depends” as always 🤣</p>
<p>In the next chapter we’re going to be talking about some famous distributed architecture sagas! stay tuned</p>
]]></content:encoded></item><item><title><![CDATA[Software Architecture - The Hard Parts [Chapter 9] Data Ownership and Distributed Transactions [Part 2]]]></title><description><![CDATA[Hey folks! This is the second part to the previous article where we discussed data ownership and which data belongs to which service. In this part we’ll dive into distributed transactions and talk specifics. If you didn’t read the first part make sur...]]></description><link>https://hewi.blog/software-architecture-the-hard-parts-chapter-9-data-ownership-and-distributed-transactions-part-2</link><guid isPermaLink="true">https://hewi.blog/software-architecture-the-hard-parts-chapter-9-data-ownership-and-distributed-transactions-part-2</guid><category><![CDATA[Software Engineering]]></category><category><![CDATA[software architecture]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sat, 02 Nov 2024 13:39:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1730554756504/918e3b25-5e0d-449f-86b4-874b706e6958.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hey folks! This is the second part to the previous article where we discussed data ownership and which data belongs to which service. In this part we’ll dive into distributed transactions and talk specifics. If you didn’t read the first part make sure you do so <a target="_blank" href="https://hashnode.com/post/cm2qgwjoq000009l5fpmxcs72">here</a>. Let’s dive in.</p>
<h1 id="heading-introduction">Introduction</h1>
<p>When architects think about transactions, they usually think about a single atomic unit of work where multiple database updates are either committed together or all rolled back when an error occurs. ACID is an acronym describing the basic properties of an atomic single unit of work database transaction (atomicity, consistency, isolation and durability)</p>
<p>Let’s first briefly talk about ACID before moving on to distributed transactions.</p>
<h1 id="heading-acid-properties">ACID Properties</h1>
<p><strong>Atomicity</strong> means a transaction must either commit or rollback all of its updates in a single unit of work. All updates are treated as a collective whole so all changes either get committed or rolled back as one unit.</p>
<p><strong>Consistency</strong> means that during the course of the transaction the database would never be left in an inconsistent state or violate any integrity constraints specified in the database.</p>
<p><strong>Isolation</strong> refers to the degree of which individual transactions interact with each other. It protects uncommitted transaction data from being visible to other transactions during the course of the business request.</p>
<p><strong>Durability</strong> means that once a successful response from a transaction commit occurs, it is guaranteed that all the data updates are permanent regardless of further system failures.</p>
<p>ACID can exist within the context of each service in a distributed architecture, but only if the corresponding database supports ACID properties, each service can perform its own commits and rollbacks to the tables it owns within the scope of the atomic business transaction. <strong>However</strong> if the business request spans multiple services the entire business request cannot be an ACID transaction rather it becomes a <mark>distributed transaction.</mark></p>
<p><strong>Distributed transactions</strong> occur when an atomic business request containing multiple database updates is performed by separately deployed remote services.</p>
<p><strong>Distributed transactions</strong> DO NOT SUPPORT ACID TRANSACTIONS</p>
<p><strong>Atomicity</strong> is not supported because each service commits its own data and performs only one part of the overall atomic business request. So atomicity is bound to the service not the entire request.</p>
<p><strong>Consistency</strong> Is not supported because a failure in one service causes the data to be out of sync between the tables responsible for the business request.</p>
<p><strong>Isolation</strong> is not supported because an insertion in any of the services data while being a part of the whole transaction but it gets committed so it’s visible to read.</p>
<p><strong>Durability</strong> is not supported because as mentioned before its per service where there are multiple databases and anything could go wrong in any of them. Supported for each individual service though.</p>
<h1 id="heading-distributed-transactions">Distributed Transactions</h1>
<p>Instead of ACID, distributed transactions support something called <strong>BASE.</strong></p>
<p>Completely opposite of ACID BASE is;</p>
<ol>
<li><p>Basic availability</p>
</li>
<li><p>Soft State</p>
</li>
<li><p>Eventual Consistency</p>
</li>
</ol>
<p><strong>Basic availability</strong> means all the involved services are expected to be available when a distributed transaction is pending.</p>
<p><strong>Soft State</strong> describes a situation where a distributed transaction is in progress and the state of the atomic business request is not yet completed (or in some cases not even known). Basically meaning partial services commit or there is a wait time until we get an acknowledgment that everything has worked (or not).</p>
<p><strong>Eventual consistency</strong> means that given enough time, all parts of the distributed transaction will complete successfully and all of the data is in sync with one another.</p>
<p>Moving on we’ll now talk about the patterns involved in eventual consistency and the caveats of each.</p>
<h2 id="heading-eventual-consistency">Eventual Consistency</h2>
<p>Distributed architectures rely heavily on eventual consistency as a trade off for better operational characteristics such as performance, scalability, elasticity, fault tolerance and availability. There are numerous ways to achieve eventual consistency but there are thee main patterns in use today</p>
<ol>
<li><p>Background synchronization pattern</p>
</li>
<li><p>Orchestrated request-based pattern</p>
</li>
<li><p>Event based pattern</p>
</li>
</ol>
<p>Lets’s dive in them</p>
<h3 id="heading-background-synchronization-pattern">Background synchronization pattern</h3>
<p>The background synchronization pattern uses a separate or external service or process to periodically check the data sources and keep them in sync with one another. The time for data to become eventually consistent depends on the nature of the background synchronization pattern whether it is a batch job running at night or every hour, etc.</p>
<p>This pattern has the <strong>longest length of time</strong> for data to become consistent, However in some cases data doesn’t need to be in sync at this very moment.</p>
<p>One of the challenges of this pattern is that the process must know what data is changed, which can be done with different ways such as querying the source tables, a database trigger or an event stream. The most important thing is that it must have knowledge of all tables and data sources involved in the transaction.</p>
<p>As efficient as this pattern is, it has some serious tradeoffs:</p>
<ol>
<li><p>All of the data sources are coupled together; breaking every bounded context rule between the data and the services. The background job needs to access different databases to be able to achieve the eventual synchronization process in the distributed transaction. They must have write access meaning that the background process has shared ownership with the service owning the tables.</p>
</li>
<li><p>Might lead to duplicate business logic because what the background job does might be already implemented In the services responsible for each table already.</p>
</li>
</ol>
<p>This pattern isn’t suitable for distributed architectures requiring tight bounded contexts (micro services), where the coupling between data ownership and functionality is a critical part of the architecture.</p>
<h3 id="heading-orchestrated-request-based-pattern">Orchestrated request-based pattern</h3>
<p>A common approach for managing distributed transactions is to make sure all of the data sources are in sync during the course of the business request (while the end user is waiting).</p>
<p>This pattern attempts to process the entire business transaction during the course of the business request. Therefore requiring some sort of orchestrator to manage the distributed transaction.</p>
<p>The orchestrator is responsible for managing all of the work needed to process the request, including knowledge of the business process, knowledge of the participants involved, multicasting logic, error handling and contract ownership.</p>
<p>One of the common ways to implement this is to designate a primary service to manage the distributed transaction. Although this approach avoids the need for a separate orchestration service, it tends to overload the responsibility of the designated service as the distributed transaction orchestrator. In addition to the role of an orchestrator the service must perform its own responsibilities as well. Also this approach leads to tight coupling and synchronous dependencies between services.</p>
<p>Using a dedicated orchestration service for the business request is a better approach here.</p>
<p>As efficient as this pattern is, it has some serious tradeoffs:</p>
<ol>
<li><p>Favors consistency over overall responsiveness.</p>
</li>
<li><p>Really complex error handling (if one service fails you’re going to have to reverse what it performed on the others) (compensating transaction)</p>
</li>
<li><p>Failures might occur even during compensation which causes data to be out of sync and would need human intervention to repair.</p>
</li>
</ol>
<h3 id="heading-event-based-pattern">Event based pattern</h3>
<p>This pattern is one of the most popular and reliable eventual consistency patterns for most modern distributed architectures. Events are used in conjunction with an asynchronous publish and subscribe messaging model to post events or command messages to a topic or event stream, services involved in the transaction listen to events and respond to them</p>
<p>The eventual consistency time is usually short for achieving data consistency because of the parallel and decoupled nature of the asynchronous message processing. Services are highly decoupled from one another and responsiveness is good because the service triggering the eventual consistency doesn’t have to wait for the data synchronization to be done before returning a response to the customer.</p>
<p>Tradeoffs here are mainly failure handling becomes complex where if a consumer is processing and fails what happens? in most brokers they will try a number of times to deliver a message and after repeated failures they will send the message to a dead letter queue. Which would either be automatically fixed or would require human intervention.</p>
<h1 id="heading-summary">Summary</h1>
<p>In distributed systems, traditional ACID transactions don’t work due to multiple services each handling their own data, leading to distributed transactions that rely on the BASE model: Basic Availability, Soft State, and Eventual Consistency. To handle eventual consistency, three main patterns exist:</p>
<ol>
<li><p><strong>Background Synchronization</strong>: Runs periodic jobs to sync data across services but can cause delays and duplicate business logic.</p>
</li>
<li><p><strong>Orchestrated Request-Based Pattern</strong>: Uses a central orchestrator to ensure all data is consistent during a request, often with complex error handling but favors consistency.</p>
</li>
<li><p><strong>Event-Based Pattern</strong>: Services respond to asynchronous events, allowing for quick, decoupled syncing. Failures may result in dead letter queues, needing human intervention at times.</p>
</li>
</ol>
<p>Each pattern has its tradeoffs in terms of consistency, speed, and complexity.</p>
<p>That’s it for this chapter and watch out for the next one where we’ll be talking about <strong>distributed data access.</strong> Hope you enjoyed!</p>
]]></content:encoded></item><item><title><![CDATA[Software Architecture - The Hard Parts [Chapter 9] Data Ownership and Distributed Transactions [Part 1]]]></title><description><![CDATA[Introduction
In this part, we’ll go through the changes that happen mainly to data once a monolithic system has been pulled apart into separate services each with its own domain. Every service abides by the bounded context rule in Domain Driven Desig...]]></description><link>https://hewi.blog/software-architecture-the-hard-parts-chapter-9-data-ownership-and-distributed-transactions-part-1</link><guid isPermaLink="true">https://hewi.blog/software-architecture-the-hard-parts-chapter-9-data-ownership-and-distributed-transactions-part-1</guid><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[software architecture]]></category><category><![CDATA[backend]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sat, 26 Oct 2024 18:01:27 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729965649670/96c35467-030d-49ce-be07-f5acb0b6d5dc.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction">Introduction</h1>
<p>In this part, we’ll go through the changes that happen mainly to data once a monolithic system has been pulled apart into separate services each with its own domain. Every service abides by the bounded context rule in Domain Driven Design where each domain has its own application code and data together.</p>
<p>I made several articles about this book where we discussed different chapters of it. I’d recommend checking them out to make sure you have a complete understanding of what’s going on. In a nutshell we’re pulling apart a huge monolithic application into several coarse grained services, and pulling apart the data as well.</p>
<p>When data is pulled apart it needs to be stitched back together to make the system work. So the main hiccup is figuring out which service owns what data and how to manage distributed transactions. Also how any service can access data they need (not own)</p>
<h1 id="heading-assigning-data-ownership">Assigning Data Ownership</h1>
<p>The main question that arises here is: which service owns which data?</p>
<p>A general rule of thumb for assigning table ownership is that services that perform write operations to a table <strong>own</strong> that table. This works well if a single service writes to the table but it gets messy when multiple services have to write to the same table.</p>
<p>A simple example is as follows;</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729946353642/2ef1967b-7017-4d84-8a3b-59fe91a37780.png" alt="Example of services and data intercommunication" class="image--center mx-auto" /></p>
<p>We have 3 services and 3 databases. As we observe all 3 services write to the <strong>Audit Table.</strong></p>
<p>The catalog service writes to the product table and the inventory service does too.</p>
<p>These examples make assigning data ownership a very complex task.</p>
<p>There are 3 common scenarios encountered when assigning data ownership to services:</p>
<ol>
<li><p>Single ownership</p>
</li>
<li><p>Common ownership</p>
</li>
<li><p>Joint ownership</p>
</li>
</ol>
<p>We’re going to dive into them and explore several techniques for resolving these scenarios</p>
<h2 id="heading-single-ownership">Single Ownership</h2>
<p>Occurs when only one service writes to the table. Very straightforward and very easy to resolve. For example above the wishlist service is the only one that writes to the wishlist table making it a single ownership scenario.</p>
<p>Knowing this information makes us conclude that the wishlist table is part of the bounded context for the wishlist service.</p>
<blockquote>
<p><em>It’s a lot easier to address single table relationships first when approaching these kinds of problems before moving on to complex scenarios</em></p>
</blockquote>
<h2 id="heading-common-ownership">Common Ownership</h2>
<p>Occurs when most or all of the services need to write to the same table. Looking back at our example we see all 3 services writing to the audit table). Since all 3 write to it’s really difficult to say who owns this table domain wise.</p>
<p>A proposed solution is to put the audit table in a shared database or schema since it receives writes from everywhere. However this has its own set of problems:</p>
<ol>
<li><p>Change control where if the schema changes you’ll have to revisit all the services writing to it and update them accordingly</p>
</li>
<li><p>Connection starvation due to the amount of services connecting to it</p>
</li>
<li><p>Scalability and fault tolerance where if it goes down it causes chaos where a lot of services encounter issues related to auditing</p>
</li>
</ol>
<p>A popular technique for addressing this is to assign a <strong>service for auditing that owns the audit table.</strong></p>
<p>So any services that need to write audits go through the audit service to write data and not access the database directly. This has a huge impact on the design and a lot of benefits</p>
<ol>
<li><p>If no acknowledgment is required a buffer (queue) can sit between the services and the audit service. Where it can process audits at its own pace</p>
</li>
<li><p>Also becomes more fault tolerant where it’ll be easier to scale the audit service independently.</p>
</li>
</ol>
<p>Applying what was said our design should look like this;</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729947199142/36c49bf7-a068-46a5-9a00-26c6cf7921d3.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-joint-ownership">Joint Ownership</h2>
<p>Occurs just like the common ownership but the catch is only a <strong>couple</strong> of services within the same domain write to the same table. Not most of them as previously described in the common ownership.</p>
<p>Looking back at our first example, only the Catalog and Inventory services perform write operations on the product table.</p>
<p>There exists several techniques to solve this ownership problem</p>
<ol>
<li><p>Table split</p>
</li>
<li><p>Data Domain</p>
</li>
<li><p>Delegation</p>
</li>
<li><p>Service Consolidation</p>
</li>
</ol>
<p>Let’s go one by one discuss them</p>
<h3 id="heading-table-split-technique">Table Split Technique</h3>
<p>The table split breaks the table into multiple tables where each service owns a part of the data it’s responsible for.</p>
<p>Looking at our example if we were able to break down the product table into 2 tables where the inventory can own the data it manipulates and so can the catalog that means we did a table split technique. This highly depends on the nature of what the inventory service writes so if it updates counts for example we can extract that column and add a product id foreign key as a reference so the inventory has its own table.</p>
<p>It moves the joint ownership to a single table ownership. However the overhead is consistent communication between both services to ensure data is synced correctly between them and that it remains in a consistent state.</p>
<p>If a new product got added for example, the catalog service needs to communicate that to the inventory service sending the id and inventory counts to it. If removed vice versa also.</p>
<p>But a lot of questions arise when syncing data between 2 tables:</p>
<ol>
<li><p>Should the communication be synchronous or asynchronous between both services?</p>
</li>
<li><p>What happens if the catalog service want to communicate to the inventory service and found that its not available? It’s a availability versus consistency question</p>
</li>
</ol>
<p>Choosing availability means that the catalog service must always add or remove regardless if the inventory service is working or not.</p>
<p>Choosing consistency means that adding or removing would fail if one of both services is down.</p>
<p>So it depends on the business requirements and what you need in order to be able to answer or make a decision.</p>
<h3 id="heading-delegate-technique">Delegate Technique</h3>
<p>In this method, one service is assigned a single ownership of the table and becomes the delegate. Any other service communicates with the delegate to perform updates on its behalf.</p>
<p>The main challenge here is to know which service to assign as the delegate (the sole owner of the table) we have two options;</p>
<ol>
<li><p><strong>Primary domain priority</strong></p>
<p> Where we assign the table to the service that most closely represents the primary domain of the data (the service that does most of the CRUD operations for a entity in that domain)</p>
</li>
<li><p><strong>Operational characteristics</strong></p>
<p> Assigning the table to the service needing higher operational characteristics such as performance, scalability, availability and throughput</p>
</li>
</ol>
<p>If we look at the following joint ownership scenario</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729948285626/d7fccfb0-5ceb-4ea8-95ff-1531c3a72e08.png" alt class="image--center mx-auto" /></p>
<p>The catalog service performs most of the CRUD operations on the product table because it creates, updates and removes products. Getting product information whilst the inventory service is responsible for retrieving and updating inventory count as well as knowing when to restock if the inventory count is too low.</p>
<p>Applying the <strong>priority domain technique</strong> results in the following</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729948415825/744a56b3-71e0-45de-9beb-b58519080793.png" alt class="image--center mx-auto" /></p>
<p>The catalog service would be assigned as the single owner of the table. The Inventory service must communicate with the catalog service to access that table.</p>
<p>Delegate techniques always force inter service communication between the services forcing them to communicate to update data. Also type of communication is key here because in <strong>synchronous communication</strong> the inventory service must wait for the inventory to be updated by the catalog service. This impacts performance but ensures data consistency. Using <strong>asynchronous communication</strong> boosts performance but makes data eventually consistent<strong>.</strong></p>
<p>With the <strong>operational characteristics priority</strong> option, the ownership rules would be reversed because the inventory updates occur at a much faster rate than static product data. In this case the ownership would be assigned to the inventory service.</p>
<p>With this option updates to the inventory can use direct database calls instead of remote access protocols. Therefore making inventory operations much faster also the most volatile data is kept consistent (inventory counts)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729963895951/54645067-7069-4cbe-9335-7627fd586474.png" alt class="image--center mx-auto" /></p>
<p>However one major problem here is the domain management responsibility. The inventory service is responsible for managing inventory counts, not creating, deleting and updating the products (and potentially error management too).</p>
<h3 id="heading-service-consolidation-technique">Service Consolidation Technique</h3>
<p>The delegate approach highlights the primary issue with joint ownership which is <strong>service dependency</strong></p>
<p>The service consolidation technique solves this by combining multiple table owners into a single consolidated service.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729964543607/138c4ac9-de21-4e35-b5e0-94a0b0397cc5.png" alt class="image--center mx-auto" /></p>
<p>Combining multiple services into one creates a <strong>coarse grained service.</strong> Which increases the testing scope as well as the deployment risk (breaking something else in the service when a new feature is added or a bug is fixed). Consolidation might affect the overall fault tolerance of the system since the whole service fails together.</p>
<p>One of the caveats is that they both have to scale equally as well even if it wasn’t necessary for one service to do so.</p>
<h1 id="heading-summary"><strong>Summary</strong></h1>
<p>This part covered different data ownership scenarios and techniques used to choose which service owns which data.</p>
<p>Summarizing everything we said, the design should look like this now:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729965196417/e5154018-7254-47ed-8b99-3cf5aee30485.png" alt class="image--center mx-auto" /></p>
<ol>
<li><p>We used <strong>single table ownership</strong> for the wishlist service</p>
</li>
<li><p>For <strong>common ownership</strong> we created an audit service with all other services sending messages via a queue (asynchronous).</p>
</li>
<li><p>Finally for the <strong>joint ownership</strong> between the catalog and the inventory with the product table, we chose the <strong>delegate technique domain priority</strong> assigning the table to the catalog service where the inventory service sends update requests to the catalog service.</p>
</li>
</ol>
<p>That’s it for part 1. In the next part we’ll go through distributed transactions and the caveats that happen when data is pulled apart. Stay tuned!</p>
]]></content:encoded></item><item><title><![CDATA[A Bird's-Eye View of Amazon Aurora's Amazing Architecture]]></title><description><![CDATA[In this article i’m going to be simply explaining the architecture Amazon’s well known relational database service Aurora; dive deep into why some decisions were made and the impact they had. I’m going to be abstracting a lot of information just so y...]]></description><link>https://hewi.blog/a-birds-eye-view-of-amazon-auroras-amazing-architecture</link><guid isPermaLink="true">https://hewi.blog/a-birds-eye-view-of-amazon-auroras-amazing-architecture</guid><category><![CDATA[AWS]]></category><category><![CDATA[aurora]]></category><category><![CDATA[Relational Database]]></category><category><![CDATA[Databases]]></category><category><![CDATA[MySQL]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sat, 05 Oct 2024 11:55:49 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1728129310915/0907ccb0-1b1b-4534-a561-fd4b7981b392.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this article i’m going to be simply explaining the architecture Amazon’s well known relational database service Aurora; dive deep into why some decisions were made and the impact they had. I’m going to be abstracting a lot of information just so you can get the general idea. I’ll try to simplify it as much as possible so you don’t have to have deep knowledge to be able to grasp the concept 😊. Enjoy</p>
<h1 id="heading-introduction">Introduction</h1>
<p>Amazon Aurora is a relational database service for OLTP (Online Transactional Processing) workloads offered as part of Amazon Web Services (AWS).</p>
<p>To be able to grasp the reasoning behind Aurora let’s take it back to what a normal relational database does.</p>
<p>In a traditional relational database system, <strong>each database server</strong> (the machine running the database) does all the work:</p>
<ol>
<li><p><strong>Processing Queries</strong>: It handles reading and writing data.</p>
</li>
<li><p><strong>Storing Data</strong>: It saves data locally or on disk.</p>
</li>
<li><p><strong>Handling Failures</strong>: If a crash happens, the same server needs to recover data, replay logs, and bring everything back online.</p>
</li>
</ol>
<p>This setup means each server is responsible for <strong>both compute (processing) and storage</strong>, which can create bottlenecks, especially in high-throughput systems. Network traffic is high because servers need to constantly synchronize data between each other to avoid data loss.</p>
<p>The <strong>I/O bottleneck</strong> usually happens because a single server has to handle both <strong>compute</strong>(processing) and <strong>storage</strong> (reading/writing data to disk). This can overwhelm the server's disk, causing performance slowdowns—especially when there's heavy load. But in a cloud environment like AWS Aurora, things work a bit differently.</p>
<h1 id="heading-auroras-architecture">Aurora’s Architecture</h1>
<h2 id="heading-the-skeleton">The Skeleton</h2>
<p>Aurora separates the <strong>storage service</strong> from the <strong>database instances</strong>. This storage service manages functions like <strong>redo logging, crash recovery, and backups</strong> independently, rather than being tightly integrated with each database instance like in traditional systems.</p>
<p>So in other words it split the processing from the storage completely which now resulted in having processing servers without the overhead of storing the data.</p>
<p>Instead of relying on a single disk or server, Aurora <strong>spreads out</strong> storage across many servers (called the "storage fleet"). This means that no single disk or server is overloaded, as the storage load is <strong>distributed</strong>. However, this introduces a new bottleneck: the <strong>network</strong>.</p>
<p>The database needs to send <strong>requests over the network</strong> to the storage servers to read or write data.</p>
<p>Even though the data is spread across many servers, the database must communicate with several of them at once, creating a lot of <strong>network traffic</strong>.</p>
<p>Also, since the database sends <strong>multiple write requests in parallel</strong> to different storage nodes, if one of those storage nodes or the network path to it is slow, it can cause <strong>delays</strong>. This means the overall speed of the database can be limited by the <strong>slowest node</strong> or network path, even if the others are performing well.</p>
<p>In simpler terms: by spreading the work across many storage servers, the disks aren’t the problem anymore, but now the speed of the <strong>network</strong> between the database and those servers becomes the main thing that can slow things down. Even one <strong>slow server</strong> in the storage fleet can affect the overall speed.</p>
<p>Now the question is <strong>How did they optimize the network problem mentioned above?</strong></p>
<h2 id="heading-design-choices">Design choices</h2>
<h3 id="heading-reducing-network-traffic-with-redo-logs"><strong>Reducing Network Traffic with Redo Logs</strong></h3>
<ul>
<li><p><strong>Traditional Problem</strong>: both <strong>compute</strong> and <strong>storage</strong> typically reside on the same server. Large chunks of data, such as full data pages, are written to disk during transactions, generating significant <strong>I/O load</strong>. As databases grow and scale, or in clustered environments, this can lead to performance bottlenecks. When <strong>compute and storage are separated</strong>, such as in a distributed cloud system like Aurora, these writes would require <strong>network communication</strong> between the database tier and the storage tier, further amplifying traffic and introducing latency.</p>
</li>
<li><p><strong>Aurora’s Solution</strong>: Aurora <strong>only sends redo logs</strong> (small records that track changes made to the database) to the storage layer, rather than full data pages. These logs are much smaller in size and require less network bandwidth, <strong>drastically reducing network I/O</strong>. This design reduces the overall data that needs to be transmitted over the network by an order of magnitude.</p>
</li>
</ul>
<h3 id="heading-parallel-writes-to-distributed-storage"><strong>Parallel Writes to Distributed Storage</strong></h3>
<ul>
<li><p><strong>Traditional Problem</strong>: In a traditional database setup, all writes would go to a single storage device, creating a bottleneck. Even with distributed systems, data replication to multiple nodes increases network load and complexity.</p>
</li>
<li><p><strong>Aurora’s Solution</strong>: Aurora writes the redo log <strong>in parallel to multiple storage nodes</strong> across multiple availability zones (AZs). This ensures that the system is resilient to node failures and improves performance by distributing the work. Instead of a single node handling all writes, they are spread across many nodes. Additionally, by <strong>splitting the I/O operations</strong> across a fleet of storage servers, it prevents overloading any single server or network link.</p>
</li>
</ul>
<h3 id="heading-asynchronous-background-operations"><strong>Asynchronous Background Operations</strong></h3>
<ul>
<li><p><strong>Traditional Problem</strong>: Operations like <strong>backups</strong> and <strong>crash recovery</strong> are usually <strong>synchronous</strong> and happen in real-time, which can spike network traffic and lead to bottlenecks.</p>
</li>
<li><p><strong>Aurora’s Solution</strong>: Aurora offloads complex tasks like <strong>backup</strong> and <strong>redo recovery</strong> to the distributed storage fleet, where they are performed <strong>continuously in the background</strong> and <strong>asynchronously</strong>. This means the database doesn’t have to pause to perform these tasks, and they don’t generate massive network loads all at once. Instead, traffic is <strong>spread out over time</strong> and across nodes.</p>
</li>
</ul>
<h3 id="heading-fault-tolerance-and-self-healing-mechanism"><strong>Fault Tolerance and Self-Healing Mechanism</strong></h3>
<ul>
<li><p><strong>Traditional Problem</strong>: If a <strong>single node or network path</strong> slows down or fails, it can cause significant performance degradation. In a split architecture, the failure of a storage node or network path can delay the entire system.</p>
</li>
<li><p><strong>Aurora’s Solution</strong>: Aurora’s storage layer is <strong>fault-tolerant</strong> and <strong>self-healing</strong>. If a storage node, disk, or network path becomes slow or fails, the system automatically reroutes traffic to healthy nodes. This reduces the impact of <strong>outliers</strong> (i.e., slow nodes or links), ensuring that performance issues at one storage node don’t bottleneck the entire system.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1728128910872/316cdb5a-3e31-408e-b9d7-f0b7c219cd58.png" alt class="image--center mx-auto" /></p>
<p>The image above is a birds-eye view of Aurora’s architecture. Aurora Uses the AWS RDS control plane and the database engine is a fork of “community” MySQL/InnoDB and diverges primarily in how InnoDB reads and writes data to disk mainly in the redo-log part as mentioned above. Backups are stored on AWS S3 blob storage.</p>
<p>This was a quick blog summarizing a refreshing architecture to the relational database realm done by Aurora. The main goal of this article was to have a high level understanding of the differences, highlighting the problems that happen at high scale and the reasoning behind these design choices. Every choice always exists because of a problem. Thank you for tuning in and till the next one :)</p>
<h1 id="heading-references">References</h1>
<ol>
<li><a target="_blank" href="https://web.stanford.edu/class/cs245/readings/aurora.pdf">https://web.stanford.edu/class/cs245/readings/aurora.pdf</a></li>
</ol>
]]></content:encoded></item><item><title><![CDATA[White Paper Summaries | Apache Flink]]></title><description><![CDATA[Hello folks! In this summary we're going to be talking about Apache Flink. We're going to dive into what it is, what problems does it aim to solve and a few deep dives here and there. Let's start
Introduction
Apache Flink is an open-source system for...]]></description><link>https://hewi.blog/white-paper-summaries-apache-flink</link><guid isPermaLink="true">https://hewi.blog/white-paper-summaries-apache-flink</guid><category><![CDATA[apache]]></category><category><![CDATA[apache-flink]]></category><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[Batch Processing]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Thu, 22 Aug 2024 12:00:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724327975148/4cc9ceaf-0b67-475c-8d44-6343123eb69b.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hello folks! In this summary we're going to be talking about Apache Flink. We're going to dive into what it is, what problems does it aim to solve and a few deep dives here and there. Let's start</p>
<h1 id="heading-introduction">Introduction</h1>
<p>Apache Flink is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continuous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows.</p>
<p>There exists mainly two types of data processing;</p>
<ol>
<li><p>Data stream processing(real time)</p>
</li>
<li><p>Batch processing(static)</p>
</li>
</ol>
<p>Both have their use cases according to different business models. However recently there has been an increase in the processing of real time data whether it be logs, changes to the application state, readings, etc..</p>
<p>However most streams aren't being treated actually as streams, they're being processed in batches (static) where the batch could be over a specific time period for example. Data collection tools, workflow managers, and schedulers orchestrate the creation and processing of batches. However these approaches suffer from high latency (imposed by batches), high complexity (connecting and orchestrating several systems, and implementing business logic twice), as well as arbitrary inaccuracy, as the time dimension is not explicitly handled by the application code.</p>
<p>Apache Flink follows a paradigm that embraces data-stream processing as the unifying model for <strong>real-time analysis, continuous streams, and batch processing</strong> both in the programming model and in the execution engine. Flink supports different notions of time (event-time, ingestion-time, processing-time) in order to give programmers high flexibility in defining how events should be correlated.</p>
<p>Batch programs are special cases of streaming programs, where the stream is finite, and the order and time of records does not matter (all records implicitly belong to one all-encompassing win- dow). However, to support batch use cases with competitive ease and performance, Flink has a specialized API for processing static data sets, uses specialized data structures and algorithms for the batch versions of opera- tors like join or grouping, and uses dedicated scheduling strategies. The result is that Flink presents itself as a full-fledged and efficient batch processor on top of a streaming runtime</p>
<h1 id="heading-system-architecture">System Architecture</h1>
<p>Now that we have a good overview on what Flink does, let's talk about its architecture</p>
<p>Flink consists of four main layers;</p>
<ol>
<li><p>Deployment</p>
</li>
<li><p>Core</p>
</li>
<li><p>API</p>
</li>
<li><p>Libraries</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724323199740/81d310e5-ae48-47f5-892c-2382d3b9330b.png" alt class="image--center mx-auto" /></p>
<p>The core of Flink is the distributed dataflow engine, which executes dataflow programs.</p>
<p>A Flink runtime program is a DAG of stateful operators connected with data streams.</p>
<blockquote>
<p><strong>Directed Acyclic Graph (DAG):</strong></p>
<ul>
<li><p>The program is represented as a DAG, where each node is a computation (e.g., a function, a transformation) and each edge represents the flow of data between these nodes.</p>
</li>
<li><p>The edges indicate the direction of data flow, from data sources through transformations to outputs.</p>
</li>
</ul>
</blockquote>
<p>There are two core APIs in Flink:</p>
<ol>
<li><p>The DataSet API for processing finite data sets (often referred to as <em>batch processing</em>)</p>
</li>
<li><p>The DataStream API for processing potentially unbounded data streams (often referred to as <em>stream processing</em>).</p>
</li>
</ol>
<p>Flink’s core runtime engine can be seen as a streaming dataflow engine, and both the DataSet and DataStream APIs create runtime programs executable by the engine.</p>
<p>As such, it serves as the common fabric to abstract both bounded (batch) and unbounded (stream) processing.</p>
<p>Flink bundles domain-specific libraries and APIs that generate DataSet and DataStream API programs, currently, FlinkML for machine learning, Gelly for graph processing and Table for SQL-like operations.</p>
<h2 id="heading-flink-cluster-architecture">Flink Cluster Architecture</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724323561320/6a0d5113-7e1b-415d-8eb4-94103474b1db.png" alt class="image--center mx-auto" /></p>
<p>Flink cluster comprises three types of processes:</p>
<ol>
<li><p>Client</p>
</li>
<li><p>Job Manager</p>
</li>
<li><p>At least one Task Manager</p>
</li>
</ol>
<p><strong>The client</strong> takes the program code, transforms it to a dataflow graph, and submits that to the JobManager. This transformation phase also examines the data types (schema) of the data exchanged between operators and creates serializers and other type/schema specific code.</p>
<p><strong>DataSet programs (batch)</strong> additionally go through a cost-based query optimization phase, similar to the physical optimizations performed by relational query optimizers.</p>
<p><strong>The</strong> <strong>JobManager</strong> coordinates the distributed execution of the dataflow, It tracks the state and progress of each operator and stream, schedules new operators, and coordinates checkpoints and recovery.</p>
<p>In a high-availability setup, the JobManager persists a minimal set of metadata at each checkpoint to a fault-tolerant storage, such that a standby JobManager can reconstruct the checkpoint and recover the dataflow execution from there.</p>
<p><strong>The actual data processing takes place in the TaskManagers</strong>. A TaskManager executes one or more operators that produce streams, and reports on their status to the JobManager. The TaskManagers maintain the buffer pools to buffer or materialize the streams, and the network connections to exchange the data streams between operators.</p>
<blockquote>
<p>An operator is a node in the DAG mentioned above, it's a processing step that the stream goes into.</p>
</blockquote>
<h1 id="heading-streaming-dataflows">Streaming Dataflows</h1>
<p>Although users can write Flink programs using a multitude of APIs, all Flink programs eventually compile down to a common representation: the dataflow graph.</p>
<p>The dataflow graph is executed by Flink’s runtime engine, the common layer underneath both the batch processing (DataSet) and stream processing (DataStream) APIs.</p>
<h2 id="heading-dataflow-graph">Dataflow Graph</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724323987642/a442e856-4eb1-4d76-a7f7-645e97b07207.png" alt class="image--center mx-auto" /></p>
<p>The dataflow graph as depicted in Figure 3 is a directed acyclic graph (DAG) that consists of the following:</p>
<ol>
<li><p>Stateful Operators</p>
</li>
<li><p>Data streams that represent data produced by an operator and are available for consumption by operators.</p>
</li>
</ol>
<p>Since dataflow graphs are executed in a data-parallel fashion, In a data-parallel execution model, the same operation is applied to different partitions of the dataset at the same time across multiple computing resources (e.g., CPUs, machines in a cluster).</p>
<p>Instead of processing one piece of data after another (sequential processing), the system processes many pieces of data in parallel.</p>
<p>Operators are parallelized into one or more parallel instances called <em>subtasks,</em> and streams are split into one or more <em>stream partitions</em> (one partition per producing subtask). The stateful operators, which may be stateless as a special case implement all of the processing logic (e.g., filters, hash joins and stream window functions).</p>
<p>Streams distribute data between producing and consuming operators in various patterns, such as point-to-point, broadcast, re-partition, fan-out, and merge.</p>
<p>So the main idea is to split the data between the producer and consumer, parallelizing it over the operators.</p>
<h2 id="heading-data-exchange-through-intermediate-data-streams"><strong>Data Exchange through Intermediate Data Streams</strong></h2>
<p><mark>Flink’s intermediate data streams are the core abstraction for data-exchange between operators. An intermediate data stream represents a logical handle to the data that is produced by an operator and can be consumed by one or more operators.</mark></p>
<p><strong><mark>Intermediate streams</mark></strong> <mark>are logical in the sense that the data they point to may or may not be materialized on disk.</mark></p>
<p><strong>Pipelined Streams</strong>: These are used in Apache Flink to allow different parts of a dataflow (producers and consumers) to run at the same time. Data is sent from one operator to the next without waiting for the entire dataset to be processed first. This allows for faster, real-time processing.</p>
<p>If a downstream operator (consumer) is slow, it can slow down the upstream operator (producer), creating "backpressure." Flink manages short-term fluctuations in data flow using buffers.</p>
<p><strong>Blocking Streams</strong>: These are used when you need to fully process and store data from one operator before moving on to the next.</p>
<p>The producing operator finishes its work and stores all its output before the consuming operator starts processing. This separates the two operators into distinct stages.</p>
<p>Since all data is stored before being passed on, blocking streams use more memory and may write data to disk if needed.</p>
<p>There’s no backpressure since the next stage only starts after the current stage is fully complete.</p>
<p>Blocking streams are useful when you need to isolate operators (like in complex operations such as sorting) to prevent issues like distributed deadlocks in the system.</p>
<hr />
<p>When <strong>Flink</strong> processes data, it splits data into chunks called buffers before sending them from one operator (producer) to another (consumer).</p>
<ul>
<li><p>A buffer can be sent as soon as it’s full, or</p>
</li>
<li><p>It can be sent after a certain amount of time, even if it’s not full. (timeout)</p>
</li>
</ul>
<p>Here comes the tradeoff between latency and throughput;</p>
<p><strong>Latency</strong>: How quickly data is processed and moved through the system.</p>
<p><strong>Throughput</strong>: How much data the system can handle in a given time period.</p>
<p><strong>Low Latency</strong>: To achieve low latency (faster response times), Flink sends buffers more quickly, even if they’re not full. This means data moves through the system faster, but the throughput (amount of data processed) might be lower. (or even small buffers)</p>
<p><strong>High Throughput</strong>: To achieve higher throughput (processing more data), Flink waits until buffers are full before sending them. This increases the amount of data processed at once but can slow down the response time, leading to higher latency. (larger buffers also)</p>
<p>Flink allows you to balance between how fast data is processed (latency) and how much data is processed at once (throughput) by adjusting how buffers are handled. Shorter timeouts mean faster data movement but lower throughput, while longer timeouts mean higher throughput but slower data movement.</p>
<hr />
<p>Apart from exchanging data, streams in Flink communicate different types of <strong>control events</strong>. These are <strong>special events</strong> injected in the data stream by operators, and are delivered in-order along with all other data records and events within a stream partition. The receiving operators react to these events by performing certain actions upon their arrival. Examples are;</p>
<ol>
<li><p>Checkpoint Barriers; used to create a snapshot of the data processing at a specific point in time.</p>
</li>
<li><p>Watermarks; markers in the data stream that show how far along the system is in processing time-based events.</p>
</li>
<li><p>Iteration Barriers; used in specialized algorithms that require multiple passes over the data (iterative algorithms).</p>
</li>
</ol>
<p>Streaming dataflows in Flink do not provide ordering guarantees after any form of repartitioning or broadcasting and the responsibility of dealing with out-of-order records is left to the operator implementation.</p>
<h2 id="heading-iterative-dataflow">Iterative Dataflow</h2>
<p>Iterations are important for tasks like graph processing and machine learning, where you often need to repeatedly process data to refine results. In some systems(traditional approaches), you either submit a new job for each iteration or add more nodes to the processing graph.</p>
<p>In Flink, iterations are managed by special operators called iteration steps. These steps allow the processing of data to repeat in a controlled manner.</p>
<p>Flink’s iteration steps use <strong>feedback edges to create loops in the data processing pipeline.</strong> This enables data to flow back into the iteration step, allowing for iterative processing.</p>
<p>Flink uses <strong>head and tail tasks (thought as operators)</strong> to manage the flow of data through the iteration steps. These tasks handle the data records that are fed back into the iteration, ensuring that the processing is coordinated.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1724326356221/75a4cca5-5225-45e7-ac2a-cdffc0d8fbdf.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-fault-tolerance">Fault Tolerance</h1>
<p>Flink offers reliable execution with strict exactly-once-processing consistency guarantees and deals with failures via checkpointing and partial re-execution.</p>
<p>The general assumption the system makes to effectively provide these guarantees is that the data sources are <strong>persistent</strong> and <strong>replayable</strong>.</p>
<p>Examples of such sources are files and durable message queues (e.g., Apache Kafka).</p>
<p>As mentioned before, Flink uses a system called checkpointing to make sure that, even if something goes wrong, your data processing continues exactly where it left off without losing or duplicating data.</p>
<p>Data streams can be huge and never-ending, so if you had to start over after a failure, it could take months to catch up. That would be impractical.</p>
<p>To avoid this, Flink regularly saves snapshots of the current state of the data processing, including the exact position in the data stream. If something fails, Flink can quickly recover using these snapshots, so it doesn’t have to reprocess everything from the beginning.</p>
<p>The core challenge they faced when saving snapshots is that when dealing with data processing, you need to ensure that all parallel operators (processing units) take a snapshot of their state at the same logical time. This means capturing a consistent view of the entire data processing system without stopping it.</p>
<p>So they introduced something called <strong>Asynchronous Barrier Snapshotting (ABS)</strong></p>
<ul>
<li><p>Special markers (called barriers) are inserted into the data streams. These barriers represent a specific point in time.</p>
</li>
<li><p>When a barrier reaches an operator, it marks that operator’s state as part of the current snapshot. Data before the barrier is included in the snapshot, and data after the barrier is not.</p>
</li>
<li><p>This process allows Flink to take snapshots without stopping the entire data processing system, thus keeping the system running smoothly.</p>
</li>
</ul>
<p>Each partition of a stream operates independently and will have its own barriers. When a barrier is inserted into the stream, it travels through each partition separately.</p>
<p>The barriers represent the same logical time across all partitions, but they may not arrive simultaneously at every partition due to differences in processing speed and network delays.</p>
<blockquote>
<p>How it exactly works in depth</p>
<ol>
<li><p>Alignment phase: Each operator in the data pipeline receives barriers from upstream operators. Before taking a snapshot, the operator makes sure that it has received all barriers from all of its input streams. This ensures that the snapshot reflects a consistent point in time across all inputs.</p>
</li>
<li><p>State Saving: After confirming all barriers are received, the operator saves its current state (e.g., contents of windows or custom data structures) to durable storage, such as HDFS or another storage system.</p>
</li>
<li><p><strong>Barrier Forwarding</strong>: Once the state is safely backed up, the operator forwards the barrier to the next operators downstream. This continues until all operators have taken their snapshots and forwarded the barriers.</p>
</li>
<li><p><strong>Complete Snapshot</strong>: The snapshot process is complete when all operators have registered their states and forwarded the barriers. The snapshot captures all operator states as they were when the barriers passed through, ensuring a consistent global snapshot.</p>
</li>
</ol>
<p>Recovery Process:</p>
<p><strong>Restoring State</strong>:</p>
<ul>
<li><p><strong>From Snapshots</strong>: When a failure occurs, Flink restores all operator states from the last successful snapshot.</p>
</li>
<li><p><strong>Restarting Streams</strong>: Input streams are restarted from the point of the latest barrier that has a snapshot. This limits the amount of data that needs to be reprocessed to just the records between the last two barriers.</p>
</li>
</ul>
</blockquote>
<p><strong>Benefits of ABS:</strong></p>
<ol>
<li><p>It guarantees exactly-once state updates without ever pausing the computation</p>
</li>
<li><p>The checkpointing mechanism is independent of other control messages in the system, like events triggering window computations. This means it doesn’t interfere with other data processing features.</p>
</li>
<li><p>ABS is not tied to any specific storage system. The state can be backed up to various storage systems depending on the environment, like file systems or databases.</p>
</li>
</ol>
<h1 id="heading-stream-analytics-on-top-of-dataflows"><strong>Stream Analytics on Top of Dataflows</strong></h1>
<p>Flink’s DataStream API is designed for stream processing, handling complex tasks like time management, windowing, and state maintenance. It builds on Flink’s runtime, which already supports efficient data transfers, stateful operations, and fault tolerance. The API allows users to define how data is grouped and processed over time, while the underlying runtime manages these operations efficiently and reliably.</p>
<h2 id="heading-the-notion-of-time"><strong>The Notion of Time</strong></h2>
<p>Flink distinguishes between two notions of time;</p>
<ol>
<li><p>Event time; which denotes the time when an event originates (e.g., the timestamp associated with a signal arising from a sensor, such as a mobile device)</p>
</li>
<li><p>Processing-time, which is the wall-clock time of the machine that is processing the data.</p>
</li>
</ol>
<p>There can be differences (skew) between event-time and processing-time, leading to potential delays when processing events based on their actual event-time.</p>
<p>Hence they introduce Watermarks;</p>
<p>Watermarks are special events used to track the progress of time within a stream processing system. They help the system understand which events have been processed and which are still pending.</p>
<p>A watermark includes a time attribute <code>t</code>, indicating that all events with a timestamp lower than <code>t</code> have been processed.</p>
<p>Watermarks originate from the sources of the data stream and travel through the entire processing topology. As they move, they help maintain a consistent view of time across different operators.</p>
<p>Operators like <code>map</code> or <code>filter</code> just forward the watermarks they receive. Operators that perform calculations based on watermarks (e.g., event-time windows) compute results triggered by the watermark and then forward it. For multiple inputs, the operator forwards the minimum of the incoming watermarks to ensure accurate results.</p>
<p>Flink programs that are based on processing-time rely on local machine clocks, and hence possess a less reliable notion of time, which can lead to inconsistent replays upon recovery. However, they exhibit lower latency. Programs that are based on event-time provide the most reliable semantics, but may exhibit latency due to event-time-processing-time lag. Flink includes a third notion of time as a special case of event-time called <em>ingestion-time</em>, which is the time that events enter Flink. That achieves a lower processing latency than event-time and leads to more accurate results in comparison to processing-time.</p>
<h2 id="heading-stateful-stream-processing"><strong>Stateful Stream Processing</strong></h2>
<p>State is critical to many applications, such as machine-learning model building, graph analysis, user session handling, and window aggregations. There is a plethora of different types of states depending on the use case. For example, the state can be something as simple as a counter or a sum or more complex, such as a classification tree or a large sparse matrix often used in machine-learning applications. Stream windows are stateful operators that assign records to continuously updated buckets kept in memory as part of the operator state.</p>
<h3 id="heading-state-management-in-flink">State Management in Flink</h3>
<ol>
<li><p><strong>Explicit State Handling</strong>:</p>
<ul>
<li><p><strong>State Registration</strong>: Flink allows users to explicitly manage state within their applications. This means users can define and work with state in a clear and controlled way.</p>
</li>
<li><p><strong>Operator Interfaces/Annotations</strong>: Flink provides interfaces or annotations that enable you to register local variables within an operator's scope. This ensures that the state you define is closely associated with the specific operator that needs it.</p>
</li>
</ul>
</li>
<li><p><strong>Operator-State Abstraction</strong>:</p>
<ul>
<li><p><strong>Key-Value States</strong>: Flink offers a high-level abstraction for state management. You can declare state as partitioned key-value pairs, which allows for efficient and flexible management of state within streaming applications.</p>
</li>
<li><p><strong>Associated Operations</strong>: Along with declaring state, Flink provides operations to interact with this state, such as reading, updating, and deleting state entries.</p>
</li>
</ul>
</li>
<li><p><strong>State Backend Configurations</strong>:</p>
<ul>
<li><p><strong>StateBackend Abstractions</strong>: Users can configure how state is stored and managed using StateBackend abstractions. This includes specifying the storage mechanism (e.g., file system, database) and how the state is checkpointed.</p>
</li>
<li><p><strong>Custom State Management</strong>: This flexibility allows for custom state management solutions tailored to specific application needs and performance requirements.</p>
</li>
</ul>
</li>
<li><p><strong>Checkpointing and Durability</strong>:</p>
<ul>
<li><strong>Exactly-Once Semantics</strong>: Flink’s checkpointing mechanism ensures that any registered state is durable and maintained with exactly-once update semantics. This means that state changes are reliably recorded and can be accurately recovered in case of failures.</li>
</ul>
</li>
</ol>
<h3 id="heading-stream-windows"><strong>Stream Windows</strong></h3>
<p>Incremental computations over unbounded streams are often evaluated over continuously evolving logical views, called windows. Apache Flink incorporates windowing within a stateful operator that is configured via a flexible declaration composed out of three core functions:</p>
<ol>
<li><p>Window <em>assigner</em></p>
</li>
<li><p>T<em>rigger (optional)</em></p>
</li>
<li><p>E<em>victor</em>.</p>
</li>
<li><p><strong>Window assigner:</strong></p>
</li>
</ol>
<p>assigns each record to one or more logical windows.</p>
<p><strong>Examples</strong>:</p>
<ul>
<li><p><strong>Time Windows</strong>: Based on timestamps (e.g., 6-second windows).</p>
</li>
<li><p><strong>Count Windows</strong>: Based on the number of records (e.g., 1000 records).</p>
</li>
<li><p><strong>Sliding Windows</strong>: Overlapping windows that can cover multiple periods or counts (e.g., a window every 2 seconds).</p>
</li>
</ul>
<ol start="2">
<li><strong>Trigger:</strong></li>
</ol>
<p>Determines when the operation associated with the window is performed.</p>
<p><strong>Examples</strong>:</p>
<ul>
<li><p><strong>Event Time Trigger</strong>: The operation happens when a watermark passes the end of the window.</p>
</li>
<li><p><strong>Count Trigger</strong>: The operation happens after a certain number of records (e.g., every 1000 records).</p>
</li>
</ul>
<ol start="3">
<li><p><strong>Evictor:</strong></p>
<p> Decides which records to keep within each window.</p>
<p> <strong>Examples</strong>:</p>
<ul>
<li><strong>Count Evictor</strong>: Keeps a fixed number of the most recent records (e.g., the last 100 records).</li>
</ul>
</li>
</ol>
<h1 id="heading-batch-analytics-on-top-of-dataflows"><strong>Batch Analytics on Top of Dataflows</strong></h1>
<p><strong>Streaming and Batch Processing</strong>: Flink uses the same runtime engine for both streaming and batch computations. This means that both types of workloads benefit from the same execution infrastructure.</p>
<p><strong>Handling Batch Computations</strong>:</p>
<ul>
<li><p><strong>Blocking Data Streams</strong>: For batch processing, large computations can be broken into isolated stages using blocked data streams. These stages are executed sequentially, which allows for efficient processing and scheduling.</p>
</li>
<li><p><strong>Turning Off Periodic Snapshotting</strong>: When the overhead of periodic snapshotting (used for fault tolerance) is high, it is turned off. Instead, fault recovery is managed by replaying lost data from the most recent materialized intermediate stream, which could be from the source.</p>
</li>
</ul>
<p><strong>Blocking Operators</strong>:</p>
<ul>
<li><p><strong>Definition</strong>: Blocking operators (like sorts) are those that wait until they have consumed their entire input before proceeding. The runtime does not differentiate between blocking and non-blocking operators.</p>
</li>
<li><p><strong>Memory Management</strong>: These operators use managed memory, which can be on or off the JVM heap. If their memory usage exceeds available memory, they can spill data to disk.</p>
</li>
</ul>
<p><strong>DataSet API</strong>:</p>
<ul>
<li><p><strong>Batch Abstractions</strong>: The DataSet API provides abstractions specifically for batch processing. It includes a bounded DataSet structure and transformations like joins, aggregations, and iterations.</p>
</li>
<li><p><strong>Fault-Tolerance</strong>: DataSets are designed to be fault-tolerant, ensuring reliable processing of batch data.</p>
</li>
</ul>
<p><strong>Query Optimization</strong>:</p>
<ul>
<li><p><strong>Optimization Layer</strong>: Flink includes a query optimization layer that transforms DataSet programs into efficient executable plans. This optimization helps improve performance and resource utilization.</p>
</li>
<li><p>Flink uses advanced techniques to optimize query execution, considering network, disk, and CPU costs, and incorporates user hints for better accuracy.</p>
</li>
</ul>
<p><strong>Memory Management</strong>: Flink improves memory efficiency by serializing data into segments, processing data in binary form, and minimizing garbage collection.</p>
<p><strong>Batch Iterations</strong>: Flink supports various iteration models and optimizes iterative processes with techniques like delta iterations for efficient computation.</p>
<p>This approach enables Flink to effectively manage and optimize both streaming and batch processing tasks, leveraging a unified runtime and specialized APIs for different types of workloads.</p>
<h1 id="heading-summary">Summary</h1>
<p>In this article, we dove deep into Apache Flink, exploring its core functionalities and advanced techniques. Key topics covered include:</p>
<ul>
<li><p><strong>Unified Data Processing</strong>: We examined how Flink’s runtime supports both streaming and batch processing, allowing seamless handling of continuous and bounded data.</p>
</li>
<li><p><strong>Fault Tolerance</strong>: We detailed Flink’s checkpointing mechanism, which ensures exactly-once processing guarantees by capturing consistent snapshots of operator states and stream positions.</p>
</li>
<li><p><strong>State Management</strong>: We explored Flink’s approach to explicit state handling, including state abstractions and custom configurations for flexible state storage and checkpointing.</p>
</li>
<li><p><strong>Windowing</strong>: We discussed Flink’s robust windowing system, which supports a variety of time-based and count-based windows, and handles out-of-order events.</p>
</li>
<li><p><strong>Batch Processing Optimization</strong>: We covered how Flink adapts its runtime for batch processing with techniques like blocking operators and efficient data management.</p>
</li>
<li><p><strong>Query Optimization</strong>: We looked into Flink’s advanced query optimization strategies, including cost-based planning and handling of complex UDF-heavy DAGs.</p>
</li>
<li><p><strong>Memory Management</strong>: We analyzed Flink’s memory management practices, including serialized data handling and off-heap memory usage to reduce garbage collection overhead.</p>
</li>
</ul>
<p>Overall, the article provided an in-depth look at how Flink handles data processing, fault tolerance, state management, and optimizations for both streaming and batch scenarios. Hope you guys enjoyed and till the next one!</p>
<h1 id="heading-references">References</h1>
<ol>
<li><a target="_blank" href="https://asterios.katsifodimos.com/assets/publications/flink-deb.pdf">https://asterios.katsifodimos.com/assets/publications/flink-deb.pdf</a></li>
</ol>
]]></content:encoded></item><item><title><![CDATA[White Paper Summaries | Apache Kafka]]></title><description><![CDATA[Hello everyone! In this white paper summary we're going to tackle a paper written by an engineer that works at LinkedIn who talks us through how Kafka was designed and some of the design choices they made. As usual we'll walk through the paper and hi...]]></description><link>https://hewi.blog/white-paper-summaries-apache-kafka</link><guid isPermaLink="true">https://hewi.blog/white-paper-summaries-apache-kafka</guid><category><![CDATA[kafka]]></category><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[System Architecture]]></category><category><![CDATA[System Design]]></category><dc:creator><![CDATA[Amr Elhewy]]></dc:creator><pubDate>Sat, 20 Jul 2024 20:09:32 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1721506101615/90b26384-3cf0-47b5-88aa-7cc572698e93.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Hello everyone! In this white paper summary we're going to tackle a paper written by an engineer that works at LinkedIn who talks us through how Kafka was designed and some of the design choices they made. As usual we'll walk through the paper and highlight the important parts. This is really me just reading it and writing out notes but I do recommend you guys read it! I'll leave the link in the reference section below.</p>
<p>These articles were made to get the important bits from the article or the parts I was interested in the most. I've added/removed and kept some parts the exact same. The end goal is to just understand how it works but all credit goes to the writer 100%.</p>
<blockquote>
<p>This paper was written in 2011 so it doesn't mention replication because Kafka didn't have it back then.</p>
</blockquote>
<h1 id="heading-introduction">Introduction</h1>
<p>The paper starts by addressing how log processing has become very critical in this day and age and moves on to say that Kafka was made to tackle some of the log processing problems they faced at Linkedin and that it took ideas from other messaging systems. Then the writer starts vouching for Kafka and how its performance and scalability Is superior compared to other systems.</p>
<p>Then moves to talk about what "log" data really is in companies (user clicks, metrics, etc) and talks about how back in the day it was mainly used for analytics but in this day and age it's used directly into production systems in real time (search relevance, recommendations, ad targeting, security protections)</p>
<p>These types of log data are very challenging due to the volume nature of them. Processing them in a fast and efficient way is an even harder challenge.</p>
<p>Then the writer then mentions that the old ways of processing these logs were scraping them all for analysis, which is very inefficient if you think about it. And that several log aggregators were built in the last years (Scribe, Flume ,etc) which normally offload the data into a HDFS (Hadoop).</p>
<p>Linkedin wanted to achieve more with log aggregation, they needed to support all the real time applications mentioned above (search relevance, etc) with a delay of no more than a few seconds.</p>
<p>Then Kafka comes in, a combination of traditional log aggregators and messaging systems. The most important thing was that it allowed consuming these logs in real time. In the next section we'll discuss different messaging systems that were available at that time and why Linkedin couldn't adopt them and had to invent Kafka instead.</p>
<h1 id="heading-related-work">Related Work</h1>
<p>The messaging systems that were available at that time weren't a good fit for log processing where there was a mismatch in features. The messaging systems then were focusing more on <strong>delivery guarantees</strong> rather than <strong>throughput</strong>. And that was considered overkill for collecting log data where if a click wasn't registered it wouldn't be the end of the world. The unneeded features increased the complexity of the system but they were made because not every system focuses on throughput as their main primary constraint. Not only this but those systems were very weak in distributed support. There is no easy way to partition and store messages on multiple machines.</p>
<p>Then the writer starts talking about how these systems usually aggregate the logs and dumps them periodically. But talks about how most of these systems usually do this processing offline (not real time) and that they use a "Push" model where they push the data to the consumers which potentially could overload the consumer if it's still processing data. They found that the "pull" model is more convenient to work with at Linkedin so each consumer can have its own rate and avoid being flooded by messages.</p>
<p>So briefing up, here's the summary:</p>
<ol>
<li><p>Linkedin wanted throughput, none of the existing systems provided that</p>
</li>
<li><p>Existing systems weren't that scalable/ real time</p>
</li>
<li><p>The push model was not going to work for Linkedin</p>
</li>
</ol>
<p>In the next section we talk about Kafka's architecture and design principles</p>
<h1 id="heading-kafka-architecture-and-design-principles">Kafka Architecture and Design Principles</h1>
<p>The basic outlines of Kafka are as follows:</p>
<ol>
<li><p>A stream of messages of a particular type is called <strong>topic</strong></p>
</li>
<li><p>A producer publishes messages to the topic</p>
</li>
<li><p>The messages are stored on a set of servers called <strong>brokers</strong></p>
</li>
<li><p>A consumer subscribes to one or more topics from the broker and <strong>pulls</strong> data from the broker.</p>
</li>
</ol>
<p>In the consumer, each message stream provides an iterator interface over the continual stream of messages being produced. The consumer iterates over the messages and <strong>blocks</strong> if there doesn't exist any.</p>
<p>Kafka supports two types of methods:</p>
<ol>
<li><p>Point to Point delivery (just a basic queue where 1 consumer only takes the message and processes it)</p>
</li>
<li><p>Pub/Sub model where multiple consumers get a copy of the same message.</p>
</li>
</ol>
<p>Below is a simple diagram visualizing what we wrote above (stole it from the paper 😅)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721478926556/18c2123d-06fe-43a7-856b-2b3bf135a5c8.png" alt class="image--center mx-auto" /></p>
<p>A topic is divided into multiple <strong>partitions</strong> and each broker stores one or more of those partitions. Multiple producers and consumers can publish and retrieve messages at the same time. In the next section we talk about partitions and some of the design choices that were made.</p>
<h2 id="heading-partition-layout-and-design-choices">Partition Layout and Design Choices</h2>
<p>Each partition of a topic has a logical log. Physically the log is a set of segment files where each file is around 1GB.</p>
<p>Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file. For better performance, we flush the segment files to disk only after a configurable number of messages have been published or a certain amount of time has elapsed. <strong>A message is only exposed to the consumers after it is flushed.</strong></p>
<blockquote>
<p>This is an example of providing Durability and Consistency over a little bit of latency (configurable). When the messages come in to the broker they are in an in memory buffer and after a configurable amount they get flushed to disk. Only when they are flushed the Consumer can see them. This gives a huge boost in durability and consistency over some milliseconds of latency which is very much worth it.</p>
</blockquote>
<p>Unlike typical messaging systems, a message stored in Kafka doesn’t have an explicit message id. Instead, each message is addressed by its <strong>logical offset</strong> in the log. This avoids the overhead of maintaining auxiliary, seek-intensive random-access index structures that map the message ids to the actual message locations. Note that our message ids are increasing but not consecutive. To compute the id of the next message, we have to add the length of the current message to its id.</p>
<p>A consumer always consumes messages from a particular partition <strong>sequentially</strong>. If the consumer acknowledges a particular message offset, <strong>it implies that the consumer has received all messages prior to that offset in the partition</strong>. Under the hood, the consumer is issuing asynchronous pull requests to the broker to have a buffer of data ready for the application to consume. Each pull request contains the offset of the message from which the consumption begins and an acceptable number of bytes to fetch. <strong>Each broker keeps in memory a sorted list of offsets that include the offset of the first message in every segment file</strong>. The broker locates the segment file where the requested message resides by searching the offset list, and sends the data back to the consumer. <strong>After a consumer receives a message, it computes the offset of the next message to consume and uses it in the next pull request</strong>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721479957458/76755461-77f0-4185-91b8-5ff06c4a9e9d.png" alt class="image--center mx-auto" /></p>
<p>The image above visualizes the in memory index present on the <strong>broker.</strong></p>
<p>Where each index is the first offset of a segment file. With a simple binary search given an offset we can find the segment file.</p>
<p>The writer also mentions that although the end consumer API iterates one message at a time, under the covers, each pull request from a consumer also retrieves multiple messages up to a certain size, typically hundreds of kilobytes. Just so it can have them ready for processing since they're already flushed on disk on the broker.</p>
<p>One of the very smart decisions the writer talks about is depending on the <strong>file system page cache</strong> instead of in <strong>memory caching</strong> for accessing recent messages. This has the following advantages:</p>
<ol>
<li><p>Avoids double buffering, there is no in memory buffer its only in the system file page cache</p>
</li>
<li><p>Completely offloads caching to the OS and not the Kafka process, Which is magnificent because there is no overhead for garbage collection</p>
</li>
<li><p>In a Kafka process restart, the cache still exists since it's the OS responsibility and it only goes if the device was rebooted.</p>
</li>
</ol>
<p>OS caching plays a huge role too since producer and consumer access segment files sequentially. It was found that both the production and the consumption have consistent performance linear to the data size, up to many terabytes of data.</p>
<p>Kafka also optimized network access for the consumers, we'll have a look at the normal OS process for reading a local file from disk and sending it over the network.</p>
<ol>
<li><p>Read the file from disk into memory (page cache in OS)</p>
</li>
<li><p>Since its in memory we need to copy it to the application cache (still in memory) but in a place in memory where the application can actually access it (the memory reserved for the application)</p>
</li>
<li><p>Then as transmission is about to begin, it offloads to a kernel buffer which is going to interact with the underlying socket and send the data.</p>
</li>
<li><p>The kernel buffer sends over the data via the socket.</p>
</li>
</ol>
<p>That's quite a lot of copies and system calls being made. Kafka optimized this by leveraging an API that exists in Linux systems called <code>sendfile</code>.</p>
<p>This directly transfers bytes from the file to the socket skipping all these copies and boosting performance.</p>
<p><strong>Stateless broker</strong>: In Kafka, the information about how much each consumer has consumed is not maintained by the broker, but by the consumer itself. Such a design reduces a lot of the complexity and the overhead on the broker. However, this makes it tricky to delete a message, since a broker doesn’t know whether all subscribers have consumed the message.</p>
<p>Kafka solves this problem by using a simple <strong>time-based SLA</strong> for the retention policy. A message is automatically deleted if it has been retained in the broker longer than a certain period, typically 7 days. This solution works well in practice. Most consumers, including the offline ones, finish consuming either daily, hourly, or in real-time. <strong>The fact that the performance of Kafka doesn’t degrade with a larger data size makes this long retention feasible.</strong></p>
<p>There is an important side benefit of this design. A consumer can deliberately <strong><em>rewind</em> back</strong> to an old offset and re-consume data. This violates the common contract of a queue, but proves to be an essential feature for many consumers. For example, when there is an error in application logic in the consumer, the application can re-play certain messages after the error is fixed. This is particularly important to ETL data loads into our data warehouse or Hadoop system.</p>
<p>As another example, the consumed data may be flushed to a persistent store only periodically (e.g, a full-text indexer). If the consumer crashes, the unflushed data is lost. In this case, the consumer can checkpoint the smallest offset of the un-flushed messages and re-consume from that offset when it’s restarted. We note that rewinding a consumer is much easier to support in the pull model than the push model. Next up is some design considerations governing Kafka being distributed in Nature.</p>
<h2 id="heading-distributed-coordination">Distributed Coordination</h2>
<p>Producers and the Consumers behave in a distributed setting. Each producer can publish a message to either a randomly selected partition or a partition semantically determined by a partitioning key/function.</p>
<p>Kafka has the concept of <strong><em>consumer groups</em>.</strong> Each consumer group consists of one or more consumers that jointly consume a set of subscribed topics, i.e., <strong>each message is delivered to only one of the consumers within the group.</strong></p>
<p>Different consumer groups each independently consume the full set of subscribed messages and <strong>no coordination is needed across consumer groups.</strong></p>
<p>The consumers within the same group can be in different processes or on different machines. Our goal is to divide the messages stored in the brokers evenly among the consumers, <strong>without introducing too much coordination overhead.</strong></p>
<p>The writer mentions that <mark>first decision</mark> was to make a partition within a topic <strong>the smallest unit of parallelism.</strong></p>
<p>This means that at any given time, all messages from one partition are consumed only by a <strong>single consumer</strong> within each consumer group.</p>
<p>Had we allowed multiple consumers to simultaneously consume a single partition, they would have to <strong>coordinate who consumes what messages,which necessitates locking and state maintenance overhead.</strong></p>
<p>In contrast, in our design consuming processes only need co-ordinate when the consumers rebalance the load, an infrequent event.</p>
<p>The <mark>second decision</mark> that we made is to not have a central “master” node, but instead let consumers coordinate among themselves in a decentralized fashion.</p>
<p>Adding a master can complicate the system since we have to further worry about master failures.</p>
<p>To facilitate the coordination, we employ a highly available consensus service <strong>Zookeeper.</strong> If you're unsure about what Zookeeper is I recommend you check out an article I wrote about it <a target="_blank" href="https://hewi.blog/navigating-the-jungle-of-distributed-systems-a-guide-to-zookeeper-and-leader-election-algorithms">here</a></p>
<p>Kafka uses Zookeeper for the following:</p>
<ol>
<li><p>Detecting the addition and the removal of brokers and consumers.</p>
</li>
<li><p>Triggering a rebalance process in each consumer when the above happens.</p>
</li>
<li><p>Maintaining the consumption relationship and keeping track of the consumed offset of each partition.</p>
</li>
</ol>
<p>When each broker or consumer starts up, it stores its information in a broker or consumer registry in Zookeeper. The broker registry contains the broker’s host name and port, and the set of topics and partitions stored on it. The consumer registry includes the consumer group to which a consumer belongs and the set of topics that it subscribes to. Each consumer group is associated with an ownership registry and an offset registry in Zookeeper. The ownership registry has one path for every subscribed partition and the path value is the id of the consumer currently consuming from this partition.</p>
<p>The offset registry stores for each subscribed partition, the offset of the last consumed message in the partition.</p>
<p>The paths created in Zookeeper are ephemeral for the broker registry, the consumer registry and the ownership registry, and persistent for the offset registry.</p>
<p>Once a new consumer is added or changes are sent to consumers (broker/consumer changes) that consumer does a <strong>rebalance process</strong> to determine the subset of partitions it should consume from. I recommend reading the rebalance algorithms directly from the paper as it's explained best there.</p>
<h2 id="heading-delivery-guarantees">Delivery Guarantees</h2>
<p>Kafka guarantees <strong>at least once delivery,</strong> Exactly once delivery can be very complex to achieve (two phase commits). Most of the time, a message is delivered exactly once to each consumer group.</p>
<p>In the case when a consumer process crashes without a clean shutdown, the consumer process that takes over those partitions owned by the failed consumer may get some duplicate messages that are after the last offset successfully committed to zookeeper.</p>
<p>If an application cares about duplicates, it must add its own de- duplication logic. This is usually a more cost-effective approach than using two-phase commits.</p>
<p>Kafka guarantees that messages from a single partition are delivered to a consumer in order. However, there is no guarantee on the ordering of messages coming from different partitions.</p>
<p>To avoid log corruption, Kafka stores a CRC for each message in the log. If there is any I/O error on the broker, Kafka runs a recovery process to remove those messages with inconsistent CRCs. Having the CRC at the message level also allows us to check network errors after a message is produced or consumed.</p>
<blockquote>
<p>In Apache Kafka, CRC (Cyclic Redundancy Check) is a mechanism used to ensure data integrity. Specifically, Kafka uses CRC to detect errors in the data being transmitted or stored.</p>
<p>Kafka uses CRC checksums in various parts of its architecture to ensure that the data remains intact from the producer to the broker and from the broker to the consumer.</p>
<p>When the data is read or transmitted, the checksum is recalculated and compared with the original checksum. If they don't match, it indicates that the data has been corrupted.</p>
</blockquote>
<p>The last and final section of the paper the writer talks about the usage of Kafka at Linkedin, I won't be writing it here but it's a good read I'd recommend reading it 100%.</p>
<h1 id="heading-reference">Reference</h1>
<ul>
<li><a target="_blank" href="https://notes.stephenholiday.com/Kafka.pdf">https://notes.stephenholiday.com/Kafka.pdf</a></li>
</ul>
]]></content:encoded></item></channel></rss>