Apache
Airflow is an open-source platform that allows
you to programmatically author, schedule and monitor
workflows. It comes with out-of-the-box integration
to lots of systems, but the adage that the devil's
in the details holds true with integration in
general and remote execution is no exception –
in particular
PowerShell Remoting which comes with Windows
as part of WinRM (Windows Remote Management).
In this post, I'll share some insights from a recent
project on how to use Airflow to orchestrate the
execution of Windows jobs without giving up on
security.
Push vs pull mode for job scheduling
Traditionally, job scheduling was done using agent software. An agent running
locally as a system service would wake up and execute jobs at the scheduled time,
reporting results back to a central system.
The configuration of the job schedule is either done by logging into
the system itself or using a control channel. For example, the agent might connect to
a central system to pull down work orders.
Meanwhile, Airflow has no such agents! Conveniently,
WinRM works in push mode. It's a service running on
Windows that you connect to using HTTP (or HTTPS). It's basically like connecting
to a database and running a stored procedure.
From a security perspective, push mode is
fundamentally different because traffic is initiated
externally. While we might want to implement a thin
agent to overcome this difference, such code is a
liability on its own. Luckily, PowerShell Remoting
comes with a framework that allows us to
substantially limit the attack surface.
PowerShell as an API
The aptly named Just-Enough-Administration (JEA) framework is basically
sudo
on steroids. It allows us to use PowerShell as an
API, constraining the remote management interface to
a configurable set of commands and executing as a
specific user.
We can avoid running arbitrary code entirely
by encapsulating the implementation details in
predefined commands. In addition, we also separate
the remote user that connects to the WinRM service
from the user context that executes commands.
You can use PowerShell Remoting without JEA
and/or constrained endpoints. But the intersection
of Airflow and Windows is typically a bigger company
or organization where security concerns mean that
you want both of these.
As an aside, I mentioned stored procedures earlier
on. Using JEA to change context to a different user
is equivalent of Definer's Rights vs Invoker's
Rights. Arguably, in a system-to-system integration,
using Definer's Rights is helpful in reducing the
attack surface because you can define and
encapsulate the required functionality.
Using JEA
The steps required to
register a JEA configuration are relatively straight-forward.
I won't describe them in detail here but the following bullets should give an overview:
In summary, registering a JEA configuration can be
as simple as defining a single role capabilities
file and running a command to register the
configuration.
Now, enter Airflow!
Prerequisites
To get started, you'll need to add the PowerShell
Remoting Protocol Provider to your Airflow
installation.
Add a connection by providing the hostname of your
Windows machine, username and password. If you're
using HTTP (rather than HTTPS) then you should set
up the connection to require Kerberos
authentication such that credentials are not sent in
clear text (in addition, WinRM will encrypt the protocol traffic
using the Kerberos session key).
To require Kerberos authentication,
provide {"auth": "kerberos"} in the
connection extras. Most of the extra configuration
options from the underlying Python library
pypsrp
are available as connection extras. For example, a
JEA configuration (if using) can be specified using
the "configuration_name" key.
You will need to install additional Python packages
to use Kerberos. Here's a requirements
file with the necessary dependencies:
apache-airflow-providers-microsoft-psrp
gssapi
krb5
pypsrp[kerberos]
Finally, a note on transport security. When WinRM is
used with an HTTP listener, Kerberos authentication
(acting as trusted 3rd party) supplants the use of
SSL/TLS through the transparent encryption scheme
employed by the protocol. You can configure WinRM to
support only Kerberos (by default, "Negotiate" is
also enabled) to ensure that all connections are
secured in this way. Note that your IT department
might still insist on using HTTPS.
Using the operator
Historically, Windows machines feel worse over time
for no particular reason. It's common to restart
them once in a while. We can use Airflow to do that!
from airflow.providers.microsoft.psrp.operators.psrp import PSRPOperator
default_args = {
"psrp_conn_id": <connection id>
}
with DAG(..., default_args=default_args) as dag:
# "task_id" defaults to the value of "cmdlet" so can omit it here.
restart_computer = PSRPOperator(cmdlet="Restart-Computer", parameters={"Force": None})
This will restart the computer forcefully (which is
not a good idea, but it illustrates the use of
parameters). In the example, "Force" is
a switch so we pass a value of None
– but values can be numbers, strings, lists and even
dictionaries.
Cut verbosity using templating
In the first example, we saw how task_id
defaults to the value of cmdlet – that is sometimes useful, but
it's not the only way we can cut verbosity.
PowerShell cmdlets (and functions which for our purposes are
the same thing) follow the naming convention verb-noun. When we define
our own commands, we can for example use the verb "Invoke", e.g. "Invoke-Job1".
But invoking stuff is something we do all the time in Airflow and we don't want
our task ids to have this meaningless prefix all over the place.
Here's an example
of fixing that, making good use of Airflow's templating syntax:
from airflow.providers.microsoft.psrp.operators.psrp import PSRPOperator
default_args = {
"psrp_conn_id": <connection id>,
"cmdlet": "Invoke-{{ task.task_id }}",
}
with DAG(..., default_args=default_args) as dag:
# "cmdlet" here will be provided automatically as "Invoke-Job1".
job1 = PSRPOperator(task_id="Job1")
Windows can have its verb-noun naming
convention and we get to have short task ids.
Output
By default, Airflow serializes operator output using
XComs – a simple means of passing state between tasks.
Since XComs must be JSON-serializable,
the PSRPOperator automatically converts
PowerShell output values to JSON using
ConvertTo-Json and then deserializes in Python before Airflow will then reserialize it when saving the XComs result to the database – there's room for optimization there! The point is that most of the time, you don't have to worry about it.
You can for example list a directory
using Get-ChildItem and the resulting table
will be returned as a list of dicts. Note that
PowerShell has some flattening magic which generally
does the right thing in terms of
return values:
In PowerShell, the results of each statement are returned as output, even without a statement that contains the Return keyword.
That is, functions don't really return a single
value. Instead, there is a stream of output values
stemming from each command being executed.
With do_xcom_push set to false, no XComs are saved
and the conversion to JSON also does not happen.
PowerShell has a number of other streams besides
the output stream. These are logged to
Airflow's task log by default. Unlike the default
logging setup, the debug is also included unless explicitly
turned off logging_level – one justification for this is given
in the next section.
Debugging
In traditional automation,
command
echoing has been a simple way to figure out what
a script is doing. PowerShell is a different beast
altogether, but it is possible to expose the commands being executed using
Set-PSDebug.
from pypsrp.powershell import Command, CommandParameter
PS_DEBUG = Command(
cmd="Set-PSDebug",
args=(CommandParameter(name="Trace", value=1), ),
is_script=False,
)
default_args = {
"psrp_conn_id": <connection id>,
"psrp_session_init": PS_DEBUG,
}
This requires that Set-PSDebug is listed
under "VisibleCmdlets" in the role capabilities (like ConvertTo-Json if using XComs).
A tracing line will be sent for each line passed
over during execution at logging level debug, but as mentioned above, this will
nonetheless get included in the task log by default. Don't enable this and have a loop that
iterates hundreds of times. You will quickly fill up the task log with useless messages.
Happy remoting!