Malthe's Homepage: Software Engineering and Architecture, Data Processing and Analytics A Blog About Computer Systems

Tue, 13 December 202222:56:00 GMT

Why pre-cloud tools are quietly dying

In Why is everyone trying to kill Airflow, Daniel Beach argues that the demise of popular open-source orchestrator Apache Airflow is exaggerated, "calling no-go on the doomsday proclaimers who are out there peddling the end of Airflow as we know it is near."

He argues that while there are lots of things to be unhappy about (read the article for a long list of those), it's fundamentally a good system, there's a strong open-source community around it and some cloud vendors are even providing the system as a managed service.

The trouble is, Airflow is cut from the same cloth as other pre-cloud era tools (the initial release was in 2015, the same year that Kubernetes came out), being both hard to scale down and hard to scale up. As Daniel Beach mentions, Airflow has inspired a range of new tools that were all built on a cloud foundation (typically Kubernetes).

Scaling down

Scaling down means being able to use the same tool in the small. You might want to use Airflow for a small project, as just another cog in the wheel of an application for its sheer utility, or in a system for continous integration.

In Scaling down is hard to do (2000), Mauri Laitinen describes how this is actually fundamentally a difficult problem in software engineering, especially from the perspective of an organization. Running small workflows often amounts to tacit knowledge that's hard to encode in a standardized way.

In more practical terms, the computing overhead of running Airflow at small scale is significant. Systems that were designed to operate in the cloud such as AWS Step Functions or Azure Data Factory are much easier to run at low scale and consequently, are offered at a low price point.

Of course, those systems are not free software, but there is no reason a free software couldn't operate in the same way. Airflow could probably be reengineered to fit better into this computing framework, but it takes a lot of effort to both move a software package forward, introducing new abstractions, and maintain backwards compatibility.

Scaling up

Scaling up means using the same tool to solve bigger problems, different kinds of problems, or lots of small problems.

One way to solve bigger problems with Airflow is to call upon specialized services to do the heavy lifting, letting Airflow handle just the orchestration aspect. But where is the fun in that? And scaling down is still hard.

Astronomer, a commercial company that's betting big on Airflow and contributing significant resources to its development, is pushing for Airflow to do more, providing for example a Python library to make it easier to run ETL workflows within Airflow. There's a lot of competition in this area however, and perhaps users will choose software that doesn't put Airflow front and center, but merely offers it as one out of many backends. Also, the name "Astro" is used liberally by Astronomer to mean different things, suggesting that they're also not quite sure what this value proposition is all about.

Surviving the cloud

Ultimately, Airflow is no distributed compute engine. Activities in Airflow are just Python programs and there are no abstractions to enable pushdown optimizations, parallel processing or pluggable storage. This simplicity is a huge part of Airflow's success, but it's also what it keeps it from solving bigger problems, or more generic problems.

The way software progresses is to find useful abstractions that help solve a class of problems. Airflow is still lacking in this regard, probably deliberately so, but part of the story is that Airflow is written in Python, a language that wasn't designed to support large programs or powerful abstractions, running on top of an operating platform that doesn't provide much functionality.

Running workloads in a cloud computing environment means renting resources efficiently. If a resource isn't used, turn it off. The operating platform or cloud fabric needs to support this sort of orchestration, or you will have to reinvent it at a higher layer, with a substantial overhead.

Not really able to scale down an individual "instance" of Airflow, there is an ongoing effort to implement multi-tenancy, still in a draft state. This work could definitely extend Airflow's shelf life, allowing it to better scale to enterprise needs without the overhead, but the road is long and there are architectural shortcomings that need to be addressed as well.

Wed, 13 April 202217:16:00 GMT

Automatic HTTPS on Kubernetes

Kubernetes

The ingress controller supported by the Kubernetes project itself is nginx. And while there are recipes for setting up automated issuing of TLS certificates using free CAs such as Let's Encrypt, there are quite a few steps involved and you will need to deploy additional services to your cluster to make it work.

Meanwhile, the ingress controller for Caddy does it fully-automated, out-of-the-box.

Enable it during install using the onDemandTLS option like so:

$ helm install \
    --namespace=caddy-system \
    --repo https://caddyserver.github.io/ingress/ \
    --atomic \
    --set image.tag=v0.1.0 \
    --set ingressController.config.onDemandTLS=true \
    --set ingressController.config.email=<your-email> \
    --set replicaCount=1 \
    --version 1.0.0 \
    main \
    caddy-ingress-controller

The email option is to allow the CA to send expiry notices if your certificate is coming up for renewal. I suppose that doesn't hurt.

HTTPS for a local setup

Sometimes it's nice to point a domain to localhost and have HTTPS working for it nonetheless – for example, when testing out authentication flows.

I use a combination of tools to achieve this:

Hitch TLS proxy to proxy traffic from 443 to 80.
Varnish Cache to proxy traffic from 80 to one or more backends, and possibly rewrite the request.
ACME Shell script in stateless mode.

In the real world, my domain is pointing to the Kubernetes cluster. But since I don't have nginx running as my ingress controller, I need an actual service to reply to the ACME requests that will be sent to <my-domain>/.well-known/acme-challenge/<key>.

Python to the rescue!

I added a deployment to the Kubernetes cluster with an image set to python:slim-bullseye and simply mounted the script below as /scripts/main.py using a configmap.

from os import environ
from http.server import BaseHTTPRequestHandler, HTTPServer

PORT = int(environ.get("PORT", 8080))
ACCOUNT_THUMBPRINT = environ["ACCOUNT_THUMBPRINT"]

class handler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("Content-type", "text/plain")
        self.end_headers()
        challenge = self.path.rsplit("/")[-1]
        message = f"{challenge}.{ACCOUNT_THUMBPRINT}"
        self.wfile.write(bytes(message, "ascii"))

with HTTPServer(("", PORT), handler) as server:
    server.serve_forever()

The deployment is set to run python /scripts/main.py. The account thumbprint is a secret key that you get when you register a session with the ACME shell script.

Kind of complicated – but at least now I can issue a TLS certificate for my domain any time using:

$ acme.sh --issue -d example.com --stateless

The setup would be a little smoother if I had published a ready-to-go container image with the script included.

Sun, 27 February 202210:48:00 GMT

Container Registry on a Budget using AWS S3

AWS

Inspired by Mike Cartmell's Budget Kubernetes Hosting for Personal Use blog series, I wanted to set up my own Kubernetes cluster on DigitalOcean on a budget (although I have since moved to Scaleway).

There was one component Mike wasn't able to skimp on – the container registry.

DigitalOcean charges $5/month for a basic plan of their container registry product which has a storage of 5 GiB. Not impressive!

Their AWS S3-compatible Spaces product provides 250 GiB of object storage for the same money – and for most regions, traffic from object storage to pods is free.

A static container registry

This got me thinking, what actually is a container registry in the context of Kubernetes? Isn't it just some static file repository with image manifest and layer data – that is, why aren't people using a simple object store to host their containers rather than a complex service?

I'm not the first person to get that idea, but things have changed in the past few years on the container scene and I ran into some obsolete technology trying out existing solutions.

I run containerd on my local system and use nerdctl as a (more or less) drop-in replacement for Docker. Some of the tools used in these other solutions such as skopeo just aren't compatible with this setup. It was actually a turn for the better, because nerdctl makes the process a lot easier.

Exporting an image

Unlike Docker's save command, nerdctl save exports images in the Docker image manifest V2, schema 2 format – essentially an OCI image index.

Exporting an image basically looks like this (assuming you've already pulled down the image):

$ nerdctl save gcr.io/google-samples/node-hello:1.0 > image.tar

Uploading an image to object storage

The file structure in a container registry isn't directly compatible with the image index format – but it's quite close.

I prepared a bash script that automates the uploading process. The code is available on Github.

The script sets an ACL (access control list) of public-read. If you know the name of the image and tag, then you can pull down the image. That's great if you're building open-source software, but sometimes you're not.

Keeping images safe from prying eyes

AWS S3 (and perhaps surprisingly, DigitalOcean Spaces) provides a quite flexible means of restricting access to object storage called bucket policies. DigitalOcean doesn't really advertise this functionality much and hardly document its usage except in their forums.

But it's quite straight-forward to add a policy that restricts access to our container registry (which has been uploaded to the /v2/ path in the object storage as per the container registry protocol).

On AWS S3, you can attach a VPC endpoint for Amazon S3 and use a condition on aws:sourceVpce to limit access to the container registry on a network level.

{
    "Statement": [
        {
            "Sid": "Allow access to container registry.",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<bucket-name>/v2/*",
            "Condition": {
                "StringNotEquals": {
                    "aws:SourceVpce": [
                        "vpce-1111111",
                        "vpce-2222222"
                    ]
                }
            }
        }
    ]
}

Not so on DigitalOcean!

We can limit on aws:SourceIp but this is sometimes awkward or impossible in which case we may instead use a clever workaround whereby we require some unguessable user agent header value (such as "secret123") and deny any request that does not have this:

Either way, set the policy using:

$ aws s3api put-bucket-policy --bucket <name> --policy "file://policy.json"

(For DigitalOcean Spaces, you need to configure the correct endpoint here per usual.)

Serving using Kubernetes

If you're using the user agent condition you will need a reverse proxy since there is no built-in mechanism when pulling a container image to provide such a custom header.

Reverse proxying functionality is included in the standard NGINX ingress controller; one way to enable it is through a server snippet annotation:

nginx.ingress.kubernetes.io/server-snippet: |
  location "/v2/" {
    proxy_pass <s3-url>/v2/;
    proxy_set_header Host <s3-url>;
    proxy_set_header User-Agent <secret-user-agent>;
  }

For example, you could serve this on registry.<your-domain> and it would look quite professional.

Happy coding!

Sun, 6 February 202211:41:00 GMT

PowerShell Remoting on Windows using Airflow

Windows

Apache Airflow is an open-source platform that allows you to programmatically author, schedule and monitor workflows. It comes with out-of-the-box integration to lots of systems, but the adage that the devil's in the details holds true with integration in general and remote execution is no exception – in particular PowerShell Remoting which comes with Windows as part of WinRM (Windows Remote Management).

In this post, I'll share some insights from a recent project on how to use Airflow to orchestrate the execution of Windows jobs without giving up on security.

Push vs pull mode for job scheduling

Traditionally, job scheduling was done using agent software. An agent running locally as a system service would wake up and execute jobs at the scheduled time, reporting results back to a central system.

The configuration of the job schedule is either done by logging into the system itself or using a control channel. For example, the agent might connect to a central system to pull down work orders.

Meanwhile, Airflow has no such agents! Conveniently, WinRM works in push mode. It's a service running on Windows that you connect to using HTTP (or HTTPS). It's basically like connecting to a database and running a stored procedure.

From a security perspective, push mode is fundamentally different because traffic is initiated externally. While we might want to implement a thin agent to overcome this difference, such code is a liability on its own. Luckily, PowerShell Remoting comes with a framework that allows us to substantially limit the attack surface.

PowerShell as an API

The aptly named Just-Enough-Administration (JEA) framework is basically sudo on steroids. It allows us to use PowerShell as an API, constraining the remote management interface to a configurable set of commands and executing as a specific user.

We can avoid running arbitrary code entirely by encapsulating the implementation details in predefined commands. In addition, we also separate the remote user that connects to the WinRM service from the user context that executes commands.

You can use PowerShell Remoting without JEA and/or constrained endpoints. But the intersection of Airflow and Windows is typically a bigger company or organization where security concerns mean that you want both of these.

As an aside, I mentioned stored procedures earlier on. Using JEA to change context to a different user is equivalent of Definer's Rights vs Invoker's Rights. Arguably, in a system-to-system integration, using Definer's Rights is helpful in reducing the attack surface because you can define and encapsulate the required functionality.

Using JEA

The steps required to register a JEA configuration are relatively straight-forward. I won't describe them in detail here but the following bullets should give an overview:

A JEA configuration exposes a remoting endpoint. When connecting using PowerShell Remoting, the endpoint can be selected using its configuration name. The default endpoint is "Microsoft.PowerShell" – it's available to local administrators and exposes an unconstrained shell.

Never use a local administrator account with Airflow or any other system-to-system integration!
Always use the "RestrictedRemoteServer" session type. This gives you a constrained endpoint to which you can add capabilities. The endpoint will operate in "NoLanguage" mode meaning that there is no scripting functionality allowed.
A JEA configuration be limited to users that are members of a particular group. But perhaps more importantly, you can map different groups to different role capabilities. These role definitions determine the functionality exposed by the endpoint for a given user. You can have a single endpoint which defines multiple sets of functionality depending on the user which connected to the endpoint.
While a JEA configuration is registered directly with Windows, role capabilities are defined using files. A role capabilities file is responsible for exposing commands (making them visible), but you can also define custom commands directly within the file. Changes to the file take effect immediately.

For technical reasons, the role capabilities file must be placed in a "RoleCapabilities" subfolder inside an existing (possible "empty") PowerShell module – see the documentation on making role capabilities available.

In summary, registering a JEA configuration can be as simple as defining a single role capabilities file and running a command to register the configuration.

Now, enter Airflow!

Prerequisites

To get started, you'll need to add the PowerShell Remoting Protocol Provider to your Airflow installation.

Add a connection by providing the hostname of your Windows machine, username and password. If you're using HTTP (rather than HTTPS) then you should set up the connection to require Kerberos authentication such that credentials are not sent in clear text (in addition, WinRM will encrypt the protocol traffic using the Kerberos session key).

To require Kerberos authentication, provide {"auth": "kerberos"} in the connection extras. Most of the extra configuration options from the underlying Python library pypsrp are available as connection extras. For example, a JEA configuration (if using) can be specified using the "configuration_name" key.

You will need to install additional Python packages to use Kerberos. Here's a requirements file with the necessary dependencies:

apache-airflow-providers-microsoft-psrp
gssapi
krb5
pypsrp[kerberos]

Finally, a note on transport security. When WinRM is used with an HTTP listener, Kerberos authentication (acting as trusted 3rd party) supplants the use of SSL/TLS through the transparent encryption scheme employed by the protocol. You can configure WinRM to support only Kerberos (by default, "Negotiate" is also enabled) to ensure that all connections are secured in this way. Note that your IT department might still insist on using HTTPS.

Using the operator

Historically, Windows machines feel worse over time for no particular reason. It's common to restart them once in a while. We can use Airflow to do that!

from airflow.providers.microsoft.psrp.operators.psrp import PSRPOperator

default_args = {
    "psrp_conn_id": <connection id>
}

with DAG(..., default_args=default_args) as dag:
    # "task_id" defaults to the value of "cmdlet" so can omit it here.
    restart_computer = PSRPOperator(cmdlet="Restart-Computer", parameters={"Force": None})

This will restart the computer forcefully (which is not a good idea, but it illustrates the use of parameters). In the example, "Force" is a switch so we pass a value of None – but values can be numbers, strings, lists and even dictionaries.

Cut verbosity using templating

In the first example, we saw how task_id defaults to the value of cmdlet – that is sometimes useful, but it's not the only way we can cut verbosity.

PowerShell cmdlets (and functions which for our purposes are the same thing) follow the naming convention verb-noun. When we define our own commands, we can for example use the verb "Invoke", e.g. "Invoke-Job1". But invoking stuff is something we do all the time in Airflow and we don't want our task ids to have this meaningless prefix all over the place.

Here's an example of fixing that, making good use of Airflow's templating syntax:

from airflow.providers.microsoft.psrp.operators.psrp import PSRPOperator

default_args = {
    "psrp_conn_id": <connection id>,
    "cmdlet": "Invoke-{{ task.task_id }}",
}

with DAG(..., default_args=default_args) as dag:
    # "cmdlet" here will be provided automatically as "Invoke-Job1".
    job1 = PSRPOperator(task_id="Job1")

Windows can have its verb-noun naming convention and we get to have short task ids.

Output

By default, Airflow serializes operator output using XComs – a simple means of passing state between tasks.

Since XComs must be JSON-serializable, the PSRPOperator automatically converts PowerShell output values to JSON using ConvertTo-Json and then deserializes in Python before Airflow will then reserialize it when saving the XComs result to the database – there's room for optimization there! The point is that most of the time, you don't have to worry about it.

You can for example list a directory using Get-ChildItem and the resulting table will be returned as a list of dicts. Note that PowerShell has some flattening magic which generally does the right thing in terms of return values:

In PowerShell, the results of each statement are returned as output, even without a statement that contains the Return keyword.

That is, functions don't really return a single value. Instead, there is a stream of output values stemming from each command being executed.

With do_xcom_push set to false, no XComs are saved and the conversion to JSON also does not happen.

PowerShell has a number of other streams besides the output stream. These are logged to Airflow's task log by default. Unlike the default logging setup, the debug is also included unless explicitly turned off logging_level – one justification for this is given in the next section.

Debugging

In traditional automation, command echoing has been a simple way to figure out what a script is doing. PowerShell is a different beast altogether, but it is possible to expose the commands being executed using Set-PSDebug.

from pypsrp.powershell import Command, CommandParameter

PS_DEBUG = Command(
    cmd="Set-PSDebug",
    args=(CommandParameter(name="Trace", value=1), ),
    is_script=False,
)

default_args = {
    "psrp_conn_id": <connection id>,
    "psrp_session_init": PS_DEBUG,
}

This requires that Set-PSDebug is listed under "VisibleCmdlets" in the role capabilities (like ConvertTo-Json if using XComs).

A tracing line will be sent for each line passed over during execution at logging level debug, but as mentioned above, this will nonetheless get included in the task log by default. Don't enable this and have a loop that iterates hundreds of times. You will quickly fill up the task log with useless messages.

Happy remoting!

In reverse chronological order Thoughts and Writings