In reverse chronological order Thoughts and Writings

About me About me

Hello!

Thanks for visiting my website. I run a technology-focused consulting company where I do problem-solving and leadership for both startups and established companies (see my professional profile for more information).

Airplane
Photo by Jeremy Bishop on Unsplash

Tue, 18 October 202220:54:00 GMT

Mitigating Username-Attacks on Digital Identity Logins

In Denmark, we've had a digital national identity service in operation since 2010, used for example to get access to secure e-mail and government online services.

It was revamped last year, in October 2021, introducing a new login mechanism to improve security.

The new login mechanism (MitID) is a three-step process:

  1. Input username and continue.
  2. Open app.
  3. Review and swipe.

Unlike the original system, in the revamp of 2021, the user is not prompted to open the app.

In principle, this is a considerable improvement since the user is actively participating in the login flow. In the previous system, the user instead required a password in addition to the username. And passwords tend to be weak, for one reason or another.

Notice however, that a login-session can be opened simply by knowing the username.

A weakness in the new system

The service is now facing criticism after it has been discovered that a malicious actor can block a user from using the service if the username follows some template such as <firstname>.<lastname> (or whatever other means of guessing it).

Basically, after N failed attempts to login using a valid username, the account is suspended until manual intervention.

Needless to say, this is not acceptable for the user, nor is it scalable for the operator. But the first point is really the interesting one: a digital identity service should always allow the real user to login!

The trouble is of course not new. If multiple sessions are initiated at roughly the same time, which one is the right one? The most simple control available is to simply deny all attempts, but this includes the real user. Waiting for some amount of time to elapse is not a cure because the malicious actor can simply repeat the process.

The fundamental problem here is that a more or less arbitrary number of malicious login-sessions can be opened at any time for a given username. The internet is not a friendly place and distributed attacks are feasible. Traditional techniques such as blocking IPs are not adequate today.

Mitigating an attack

The real user must somehow pair the real login attempt with the app, disambiguating between the login attempts by the malicious actor.

There are lots of ways to devise such a mechanism. We want one that adapts to the situation. If there's just one login-session, we don't need any mitigation, but if there are multiple sessions, we want an effective means of discarding the bad ones, leaving just the single, real one.

Inspired by Apple ID, we'll use a verification model based on two-digit numbers, but expand to a matrix of 16 choices, ordered by value for convenience.

  • 09
  • 13
  • 17
  • 26
  • 29
  • 32
  • 41
  • 50
  • 58
  • 62
  • 64
  • 77
  • 81
  • 84
  • 95
  • 96

The idea is that the user will take a certain action using the app based on the numbers shown on the screen.

For example, pick out a sequence of numbers:

  • 84
  • 09
  • 26
  • 58

If we require a specific ordered sequence of 4 elements, we get a total of 43,680 combinations. Adding just two more gives us 5,765,760 combinations.

No matter the number of bad login-sessions attempted at any given time, we can always provide such a sequence to disambiguate, making it exceedingly unlikely that a malicious actor gets through.

Adjusted login-flow

The new logic kicks in only when the user opens the app. At this time, the system knows the number of concurrent login-sessions. We'll limit the validity of a login-session to a short amount of time, for example 30 seconds.

  1. If the number of sessions is 1, we'll revert to the normal behavior.
  2. Otherwise, determine an appropriate number of items to choose from.
  3. Each login-session will show a sequence with this many items.
  4. The user will choose the items from the right sequence, selecting each number in the given order.

This system can be trivially implemented to run at scale using a key/value store.

Wed, 13 April 202217:16:00 GMT

Automatic HTTPS on Kubernetes

The ingress controller supported by the Kubernetes project itself is nginx. And while there are recipes for setting up automated issuing of TLS certificates using free CAs such as Let's Encrypt, there are quite a few steps involved and you will need to deploy additional services to your cluster to make it work.

Meanwhile, the ingress controller for Caddy does it fully-automated, out-of-the-box.

Enable it during install using the onDemandTLS option like so:

$ helm install \
    --namespace=caddy-system \
    --repo https://caddyserver.github.io/ingress/ \
    --atomic \
    --set image.tag=v0.1.0 \
    --set ingressController.config.onDemandTLS=true \
    --set ingressController.config.email=<your-email> \
    --set replicaCount=1 \
    --version 1.0.0 \
    main \
    caddy-ingress-controller

The email option is to allow the CA to send expiry notices if your certificate is coming up for renewal. I suppose that doesn't hurt.

HTTPS for a local setup

Sometimes it's nice to point a domain to localhost and have HTTPS working for it nonetheless – for example, when testing out authentication flows.

I use a combination of tools to achieve this:

In the real world, my domain is pointing to the Kubernetes cluster. But since I don't have nginx running as my ingress controller, I need an actual service to reply to the ACME requests that will be sent to <my-domain>/.well-known/acme-challenge/<key>.

Python to the rescue!

I added a deployment to the Kubernetes cluster with an image set to python:slim-bullseye and simply mounted the script below as /scripts/main.py using a configmap.

from os import environ
from http.server import BaseHTTPRequestHandler, HTTPServer

PORT = int(environ.get("PORT", 8080))
ACCOUNT_THUMBPRINT = environ["ACCOUNT_THUMBPRINT"]

class handler(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("Content-type", "text/plain")
        self.end_headers()
        challenge = self.path.rsplit("/")[-1]
        message = f"{challenge}.{ACCOUNT_THUMBPRINT}"
        self.wfile.write(bytes(message, "ascii"))

with HTTPServer(("", PORT), handler) as server:
    server.serve_forever()

The deployment is set to run python /scripts/main.py. The account thumbprint is a secret key that you get when you register a session with the ACME shell script.

Kind of complicated – but at least now I can issue a TLS certificate for my domain any time using:

$ acme.sh --issue -d example.com --stateless

The setup would be a little smoother if I had published a ready-to-go container image with the script included.

Sun, 27 February 202210:48:00 GMT

Container Registry on a Budget using AWS S3

Inspired by Mike Cartmell's Budget Kubernetes Hosting for Personal Use blog series, I wanted to set up my own Kubernetes cluster on DigitalOcean on a budget (although I have since moved to Scaleway).

There was one component Mike wasn't able to skimp on – the container registry.

DigitalOcean charges $5/month for a basic plan of their container registry product which has a storage of 5 GiB. Not impressive!

Their AWS S3-compatible Spaces product provides 250 GiB of object storage for the same money – and for most regions, traffic from object storage to pods is free.

A static container registry

This got me thinking, what actually is a container registry in the context of Kubernetes? Isn't it just some static file repository with image manifest and layer data – that is, why aren't people using a simple object store to host their containers rather than a complex service?

I'm not the first person to get that idea, but things have changed in the past few years on the container scene and I ran into some obsolete technology trying out existing solutions.

I run containerd on my local system and use nerdctl as a (more or less) drop-in replacement for Docker. Some of the tools used in these other solutions such as skopeo just aren't compatible with this setup. It was actually a turn for the better, because nerdctl makes the process a lot easier.

Exporting an image

Unlike Docker's save command, nerdctl save exports images in the Docker image manifest V2, schema 2 format – essentially an OCI image index.

Exporting an image basically looks like this (assuming you've already pulled down the image):

$ nerdctl save gcr.io/google-samples/node-hello:1.0 > image.tar

Uploading an image to object storage

The file structure in a container registry isn't directly compatible with the image index format – but it's quite close.

I prepared a bash script that automates the uploading process. The code is available on Github.

The script sets an ACL (access control list) of public-read. If you know the name of the image and tag, then you can pull down the image. That's great if you're building open-source software, but sometimes you're not.

Keeping images safe from prying eyes

AWS S3 (and perhaps surprisingly, DigitalOcean Spaces) provides a quite flexible means of restricting access to object storage called bucket policies. DigitalOcean doesn't really advertise this functionality much and hardly document its usage except in their forums.

But it's quite straight-forward to add a policy that restricts access to our container registry (which has been uploaded to the /v2/ path in the object storage as per the container registry protocol).

On AWS S3, you can attach a VPC endpoint for Amazon S3 and use a condition on aws:sourceVpce to limit access to the container registry on a network level.

{
    "Statement": [
        {
            "Sid": "Allow access to container registry.",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<bucket-name>/v2/*",
            "Condition": {
                "StringNotEquals": {
                    "aws:SourceVpce": [
                        "vpce-1111111",
                        "vpce-2222222"
                    ]
                }
            }
        }
    ]
}

Not so on DigitalOcean!

We can limit on aws:SourceIp but this is sometimes awkward or impossible in which case we may instead use a clever workaround whereby we require some unguessable user agent header value (such as "secret123") and deny any request that does not have this:

Either way, set the policy using:

$ aws s3api put-bucket-policy --bucket <name> --policy "file://policy.json"

(For DigitalOcean Spaces, you need to configure the correct endpoint here per usual.)

Serving using Kubernetes

If you're using the user agent condition you will need a reverse proxy since there is no built-in mechanism when pulling a container image to provide such a custom header.

Reverse proxying functionality is included in the standard NGINX ingress controller; one way to enable it is through a server snippet annotation:

nginx.ingress.kubernetes.io/server-snippet: |
  location "/v2/" {
    proxy_pass <s3-url>/v2/;
    proxy_set_header Host <s3-url>;
    proxy_set_header User-Agent <secret-user-agent>;
  }

For example, you could serve this on registry.<your-domain> and it would look quite professional.

Happy coding!

Sun, 6 February 202211:41:00 GMT

PowerShell Remoting on Windows using Airflow

Apache Airflow is an open-source platform that allows you to programmatically author, schedule and monitor workflows. It comes with out-of-the-box integration to lots of systems, but the adage that the devil's in the details holds true with integration in general and remote execution is no exception – in particular PowerShell Remoting which comes with Windows as part of WinRM (Windows Remote Management).

In this post, I'll share some insights from a recent project on how to use Airflow to orchestrate the execution of Windows jobs without giving up on security.

Push vs pull mode for job scheduling

Traditionally, job scheduling was done using agent software. An agent running locally as a system service would wake up and execute jobs at the scheduled time, reporting results back to a central system.

The configuration of the job schedule is either done by logging into the system itself or using a control channel. For example, the agent might connect to a central system to pull down work orders.

Meanwhile, Airflow has no such agents! Conveniently, WinRM works in push mode. It's a service running on Windows that you connect to using HTTP (or HTTPS). It's basically like connecting to a database and running a stored procedure.

From a security perspective, push mode is fundamentally different because traffic is initiated externally. While we might want to implement a thin agent to overcome this difference, such code is a liability on its own. Luckily, PowerShell Remoting comes with a framework that allows us to substantially limit the attack surface.

PowerShell as an API

The aptly named Just-Enough-Administration (JEA) framework is basically sudo on steroids. It allows us to use PowerShell as an API, constraining the remote management interface to a configurable set of commands and executing as a specific user.

We can avoid running arbitrary code entirely by encapsulating the implementation details in predefined commands. In addition, we also separate the remote user that connects to the WinRM service from the user context that executes commands.

You can use PowerShell Remoting without JEA and/or constrained endpoints. But the intersection of Airflow and Windows is typically a bigger company or organization where security concerns mean that you want both of these.

As an aside, I mentioned stored procedures earlier on. Using JEA to change context to a different user is equivalent of Definer's Rights vs Invoker's Rights. Arguably, in a system-to-system integration, using Definer's Rights is helpful in reducing the attack surface because you can define and encapsulate the required functionality.

Using JEA

The steps required to register a JEA configuration are relatively straight-forward. I won't describe them in detail here but the following bullets should give an overview:

  • A JEA configuration exposes a remoting endpoint. When connecting using PowerShell Remoting, the endpoint can be selected using its configuration name. The default endpoint is "Microsoft.PowerShell" – it's available to local administrators and exposes an unconstrained shell.

    Never use a local administrator account with Airflow or any other system-to-system integration!

  • Always use the "RestrictedRemoteServer" session type. This gives you a constrained endpoint to which you can add capabilities. The endpoint will operate in "NoLanguage" mode meaning that there is no scripting functionality allowed.
  • A JEA configuration be limited to users that are members of a particular group. But perhaps more importantly, you can map different groups to different role capabilities. These role definitions determine the functionality exposed by the endpoint for a given user. You can have a single endpoint which defines multiple sets of functionality depending on the user which connected to the endpoint.
  • While a JEA configuration is registered directly with Windows, role capabilities are defined using files. A role capabilities file is responsible for exposing commands (making them visible), but you can also define custom commands directly within the file. Changes to the file take effect immediately.

    For technical reasons, the role capabilities file must be placed in a "RoleCapabilities" subfolder inside an existing (possible "empty") PowerShell module – see the documentation on making role capabilities available.

In summary, registering a JEA configuration can be as simple as defining a single role capabilities file and running a command to register the configuration.

Now, enter Airflow!

Prerequisites

To get started, you'll need to add the PowerShell Remoting Protocol Provider to your Airflow installation.

Add a connection by providing the hostname of your Windows machine, username and password. If you're using HTTP (rather than HTTPS) then you should set up the connection to require Kerberos authentication such that credentials are not sent in clear text (in addition, WinRM will encrypt the protocol traffic using the Kerberos session key).

To require Kerberos authentication, provide {"auth": "kerberos"} in the connection extras. Most of the extra configuration options from the underlying Python library pypsrp are available as connection extras. For example, a JEA configuration (if using) can be specified using the "configuration_name" key.

You will need to install additional Python packages to use Kerberos. Here's a requirements file with the necessary dependencies:

apache-airflow-providers-microsoft-psrp
gssapi
krb5
pypsrp[kerberos]

Finally, a note on transport security. When WinRM is used with an HTTP listener, Kerberos authentication (acting as trusted 3rd party) supplants the use of SSL/TLS through the transparent encryption scheme employed by the protocol. You can configure WinRM to support only Kerberos (by default, "Negotiate" is also enabled) to ensure that all connections are secured in this way. Note that your IT department might still insist on using HTTPS.

Using the operator

Historically, Windows machines feel worse over time for no particular reason. It's common to restart them once in a while. We can use Airflow to do that!

from airflow.providers.microsoft.psrp.operators.psrp import PSRPOperator

default_args = {
    "psrp_conn_id": <connection id>
}

with DAG(..., default_args=default_args) as dag:
    # "task_id" defaults to the value of "cmdlet" so can omit it here.
    restart_computer = PSRPOperator(cmdlet="Restart-Computer", parameters={"Force": None})

This will restart the computer forcefully (which is not a good idea, but it illustrates the use of parameters). In the example, "Force" is a switch so we pass a value of None – but values can be numbers, strings, lists and even dictionaries.

Cut verbosity using templating

In the first example, we saw how task_id defaults to the value of cmdlet – that is sometimes useful, but it's not the only way we can cut verbosity.

PowerShell cmdlets (and functions which for our purposes are the same thing) follow the naming convention verb-noun. When we define our own commands, we can for example use the verb "Invoke", e.g. "Invoke-Job1". But invoking stuff is something we do all the time in Airflow and we don't want our task ids to have this meaningless prefix all over the place.

Here's an example of fixing that, making good use of Airflow's templating syntax:

from airflow.providers.microsoft.psrp.operators.psrp import PSRPOperator

default_args = {
    "psrp_conn_id": <connection id>,
    "cmdlet": "Invoke-{{ task.task_id }}",
}

with DAG(..., default_args=default_args) as dag:
    # "cmdlet" here will be provided automatically as "Invoke-Job1".
    job1 = PSRPOperator(task_id="Job1")

Windows can have its verb-noun naming convention and we get to have short task ids.

Output

By default, Airflow serializes operator output using XComs – a simple means of passing state between tasks.

Since XComs must be JSON-serializable, the PSRPOperator automatically converts PowerShell output values to JSON using ConvertTo-Json and then deserializes in Python before Airflow will then reserialize it when saving the XComs result to the database – there's room for optimization there! The point is that most of the time, you don't have to worry about it.

You can for example list a directory using Get-ChildItem and the resulting table will be returned as a list of dicts. Note that PowerShell has some flattening magic which generally does the right thing in terms of return values:

In PowerShell, the results of each statement are returned as output, even without a statement that contains the Return keyword.

That is, functions don't really return a single value. Instead, there is a stream of output values stemming from each command being executed.

With do_xcom_push set to false, no XComs are saved and the conversion to JSON also does not happen.

PowerShell has a number of other streams besides the output stream. These are logged to Airflow's task log by default. Unlike the default logging setup, the debug is also included unless explicitly turned off logging_level – one justification for this is given in the next section.

Debugging

In traditional automation, command echoing has been a simple way to figure out what a script is doing. PowerShell is a different beast altogether, but it is possible to expose the commands being executed using Set-PSDebug.

from pypsrp.powershell import Command, CommandParameter

PS_DEBUG = Command(
    cmd="Set-PSDebug",
    args=(CommandParameter(name="Trace", value=1), ),
    is_script=False,
)

default_args = {
    "psrp_conn_id": <connection id>,
    "psrp_session_init": PS_DEBUG,
}

This requires that Set-PSDebug is listed under "VisibleCmdlets" in the role capabilities (like ConvertTo-Json if using XComs).

A tracing line will be sent for each line passed over during execution at logging level debug, but as mentioned above, this will nonetheless get included in the task log by default. Don't enable this and have a loop that iterates hundreds of times. You will quickly fill up the task log with useless messages.

Happy remoting!

Archive: