Kodex (Community Edition – CE) is an open-source toolkit for privacy and security engineering. It helps you to automate data security and data protection measures in your data engineering workflows. It offers the following functionality:
- Read data items from a variety of sources such as files, databases or message queues.
- Protect these data items using various privacy- & security enhancing transformations, like de-identification, masking, pseudonymization, anonymization or encryption.
- Send the protected items to a variety of destinations.
With Kodex, you can describe your data protection and data security workflows using a simple, declarative configuration language: Just like DevOps tools let you describe infrastructure as code, Kodex is a PrivacyOps & SecurityOps tool that let you describe privacy and security measures as code.
Kodex takes care of the boring and difficult aspects of privacy, such as
- Key management: Kodex manages encryption and pseudonymization keys for you (if you want that).
- Parameter management: Kodex keeps track of how every single data item was processed so you can prove the compliance of your data workflows and create an audit trail.
- Data transformation: Kodex implements modern cryptographic and statistical techniques to protect your data.
Getting started
To download and install Kodex from source, simply run
git clone https://github.com/kiprotect/kodex
cd kiprotect
make
make install
Documentation
CLI
The command line utility is the easiest way to get started analyzing and transforming data with Kodex.
Blueprints
Blueprints are the configuration files that specify how Kodex should load, analyze, transform and write data.
Actions
Actions perform specific operations on input items, e.g. to analyze or transform them. Everything that Kodex does with data is described by an action.
Kodex Community Edition (CE)
To install the community edition of Kodex, simply run
git clone https://github.com/kiprotect/kodex
cd kodex
make
make install
That’s it! You can now process blueprints with the kodex
CLI command.
Kodex Enterprise Edition (EE)
To install the enterprise edition of Kodex (Kodex-EE), first download kodex-ee
from our web portal. The binary contains everything you need to get started. Notably, it provides an install
command that (mostly) automates the installation of Kodex-EE for different environments. For example, to create a Docker compose configuration with all Kodex components and required external services, simply run
kodex-ee install –method docker-compose
cd docker-compose
docker-compose up
That should give you a fully functional test system with the Kodex frontend running on port 4242. The following sections explain in detail how to get Kodex-EE running on your infrastructure in a more permanent way.
Deployment Options
Kodex-EE is deployed as a set of stateless Linux services that interact with external databases and systems, as described above. A full deployment consists of at least one daemon and one API instance. Both of these services are stateless and can be configured using a simple settings file. The API is designed to be used behind a reverse proxy that handles TLS and load balancing (example configurations for servers like nginx or caddy is available).
Requirements
Kodex-EE requires several external services to function:
- A PostgreSQL database (preferrably version 9.6 or newer), which Kodex-EE uses to store meta-data about data streams, actions and parameters. Specific Kodex-EE plugins (e.g. Klaro, Konsens) use their own PostgreSQL schemas for data storage. For stability and performance reasons it is advisable to provide separate PostgreSQL instances to these.
- A RabbitMQ server (preferrably version 3.8 or newer), or a Kafka server (preferrably version 2.7 or newer), which Kodex-EE uses to store intermediate data items during processing.
- An authentication service based on SSO, SAML, OpenID-Connect or LDAP, which Kodex-EE can use to authenticate users. The service must provide a mapping of users to organizations and roles within those organizations.
- A Redis database (preferrably version 5.0 or newer), which Kodex-EE uses to e.g. store API and data processing metrics as well as data necessary for rate limiting.
- Optionally, either a persistent directory/volume that is accessible from API and daemon nodes, or a “blob” storage service with an S3-compatible API. Kodex-EE uses these to store file data that is generated for certain background tasks.
The given minimum versions of services are what Kodex-EE is tested and deployed against, it might be possible to run it with older versions as well, though we cannot make any guarantees. The same is true when using newer versions of these services.
PostgreSQL Database(s)
Kodex-EE uses a primary database schema to store meta-data about things like data transformations, sources and destinations. For regular usage, only a small number of table entries are generated, and the primary database schema rarely grows beyond a few GB. Specific Kodex-EE plugins (e.g. Klaro or Konsens) use secondard database schemas to store data results, which grow in proportion to the processed data. Kodex-EE can implement a data retention schedule and will delete old data, nevertheless the required storage will depend on the amount of data processed. Please note that performance may also degrade if very large amounts of data are being processed. All Kodex-EE data schemas are designed to handle billions of rows of data without significant problems though, provided adequate hardware has been provisioned for the databases.
Redis Database(s)
Kodex-EE uses Redis for metrics collection and keeping of internal state during processing (e.g. for anonymization). Metrics are generated using a hierarchical time-spans (hours, days, weeks, months, years) and individual data items expire automatically. Internal state storage requirements will vary depending on the type of data processing and the number of items processed.
Message Queue(s) (RabbitMQ or Kafka)
Kodex-EE stores processed data in a message queue for internal processing and buffering. The amount of data stored in those queues depends on many factors, notably the number of incoming data items, the number of daemon nodes and the read/write capacity of data sources and destinations. Kodex-EE will automatically apply backpressure when writes to the internal queue become impossible.
Daemon & API Requirements
Kodex EE can run on systems with minimal resources like an embedded system or SoC.
Operating System
Kodex-EE is delivered as a statically compiled X64/AMD64 binary that is compatible with all mayor Linux distributions. Running Kodex-EE on Windows, MacOS and OpenBSD should be possible but is currently untested and unsupported.
Memory
We recommend at least 512 MB of available RAM for both the API and the daemon. Actual memory usage can vary depending on the workloads but Kodex EE will try to automatically limit data processing in order to not exhaust all available memory.
CPU
Both the daemon and the API process data concurrently in parallel threads, hence they can effectively utilize multiple CPU cores / processing threads.
Storage
Apart from task-specific storage mentioned above, neither the daemon nor the API require persistent on-disk data storage.
Networking
Kodex-EE needs to have connectivity to data sources and destinations from which it should read or to which it should write data. Processing speed will be limited by available network bandwidth.
Operations
The following sections describe operations-related aspects of Kodex-EE.
Metrics
Kodex-EE collects various metrics that are exposed via the REST API and supports instrumentation for Prometheus.
Logging
Kodex-EE has an internal, level-based logging system that is compatible with syslog.
Command Line Interface (CLI)
The Kodex command line interface (CLI) is the easiest way to get started with privacy and security engineering. It enables you to run a wide range of privacy- & security-enhancing transformations as well as analyses on your structured data.
Installing Kodex CLI
First, you need to download or build the Kodex CLI tool. You can download pre-built binaries for various platforms on our website. Alternatively, you can build the tool from source by following the instructions on our Github page.
Getting started
By default, we control the CLI tool using so-called blueprints. A blueprint is a config file (or a collection of them) that describes how Kodex should read, analyze, transform and write structured data.
To run a blueprint, you simply execute kodex run [blueprint name]
. Kodex comes with a free, public repository of example blueprints that help us to get started. We can download and install them via the command line as well:
kodex blueprints download
This will download our public blueprints repository and store it in a local directory (by default ~/.kodex/blueprints
). You can then run any blueprint by simply specifying its path relative to the blueprints directory. So let’s run a simple example that shows how Kodex can pseudonymize different data types:
kodex run pseudonymization/examples/data-types/pseudonymize
This will load the configuration from the blueprint file (pseudonymize.yml
). This file specifies from where data should be read (a JSON file in this case), how the data should be transformed (using a pseudonymization in this case) and where the resulting output data should be sent to (a JSON file again).
The items in the input.json file that we want to pseudonymize.
{
“name”: “test”,
“date”: “2020-06-04”,
“ip”: “42.34.122.112/32”,
“count”: 4354
}
{
“name”: “another test”,
“date”: “2019-07-02”,
“ip”: “42.34.122.114/32”,
“count”: 214
}
The example blueprint that we picked will read data items from an input.json
file located in the same directory as the blueprint, pseudonymize all attributes of each item using different applicable pseudonymization methods, and write the pseudonymized data to a JSON file (pseudonymized.json
) in the current directory. Here’s what the output looks like:
{
“_kip”: “278ba5f7db26ca661b4e64b1eb6abb3d4e7d1aa15e55155b3a1f7626424f679c”,
“count”: 7643,
“date”: “2021-09-11”,
“ip”: “150.39.196.226/32”,
“name”: “3YdcJQ==”
}
{
“_kip”: “278ba5f7db26ca661b4e64b1eb6abb3d4e7d1aa15e55155b3a1f7626424f679c”,
“count”: 46,
“date”: “2015-07-19”,
“ip”: “150.39.196.8/32”,
“name”: “gKGhSKShnHg/vny6”
}
As you can see, Kodex pseudonymized every attribute in every data item and also added a new attribute, _kip
, to the items. The value of that attribute refers to a parameter set that contains the cryptographic keys that were used to transform the data. The actual keys are stored in a so-called parameter store. If you don’t want that, you can also manage keys yourself as well: The pseudonymize-with-key
blueprint in the same directory does that by first asking you to enter a key and then using that key to derive further encryption keys for the individual pseudonymization operations. Key & parameter management is a complex topic in itself, for now just rest assured that Kodex takes care of the messy details for you.
Depseudonymizing data
At some point you might actually want to depseudonymize your data again. Kodex makes this easy by providing an undo
action that can be applied to reversible transformations like the cryptographic pseudonymization above. So, to depseudonymize the data above, we can simply run a blueprint that contains such an undo
action:
kodex run pseudonymization/examples/data-types/depseudonymize
which will print the depseudonymized data (which should exactly match the input data). If you provided a pseudonymization key yourself by using the pseudonymize-with-key
, you can run depseudonymize-with-key
blueprint instead, which will ask you to enter the pseudonymization key. Isn’t that easy?
Actions
Actions are how Kodex transforms or analyzes data. Actions can be called individually or as a sequence, in which each action will receive the result of the previous action. The following sections describe the different action types that Kodex currently supports.
Generic Validation & Transformation
Often data is complex & hierarchical, making it difficult to apply simple transformations to it. For this case Kodex support generic form-based validation and transformation of data. This enables you to easily parse, validate and transform complex, hierarchical data like JSON documents. You can read more about this type of transformation in the form action documenation.
Pseudonymization
Pseudonymization produces data that is no longer directly attributable to a specific individual. Using pseudonymous data lowers the risk of data processing for individuals and reduces the impact of data loss or data theft. It can be applied to direct or indirect identifiers as well as to a wide range of structured data types like numbers, dates, names or IP addresses. Some pseudonymization methods are based on reversible encryption, which makes it possible to de-pseudonymize the data again given knowledge of the key. Other methods like hashing are non-reversible.
Encryption
Encryption produces data that is statistically indistinguishable from random noise and that can be decrypted only with knowledge of a secret encryption key. Kodex implements standard symmetric and asymmetric encryption techniques.
Please note that we consider encryption methods which produce the same ciphertext when given identical input data and encryption key(s) as pseudonymization methods, as the resulting data does (intentionally) not conform to modern security standards for encryption methods. Therfore, format-preserving encryption methods that operate without random initialization vectors (IV) are also considered pseudonymization methods under this approach.
Anonymization
Anonymization actions use statistical techniques to produce data from which it it should not be possible to identify any specific individuals or infer any non-trivial information about these individual. Kodex relies on randomization and aggregation as anonymity mechanisms and produces anonymous data that conforms to modern anonymity standards like “differential privacy”.
Discovery
Discovery actions enable you to detect different types of personal or sensitive information in your structured and unstructured data items.
Identification & Identity Management
Identification & identity management actions enable you to identify individuals based on various direct or indirect identifiers, and associate a permanent pseudonymous ID to every individual. This allows you to e.g. attribute data items from different sources to a single individual and use the pseudonymous ID of that individual to e.g. produce anonymous or pseudonymous data or verify the consent of that individual for a given data processing purpose.
Consent Management
Consent management actions enable you to verify that a given individual has given consent for a specific data processing purpose. Together with the identification & identity management actions they enable you to build compliant data processing workflows and e.g. surpress data from individuals that have not given or withdrawn their consent for a specific processing purpose.
Audit Logging
Audit logging actions enable you to trace the flow of data from individuals through your entire data processing infrastructure. They produce pseudonymous, searchable information about how data belonging to a specific individual was processed and where it was sent to, allowing you to e.g. retrieve, amend or delete this data.
Form Action
The form
action implements our open-source form validation & transformation library, enabling you to parse, validate and transform complex, hierarchical data.
Getting Started
A form consists of a number of fields. Each field can define one or more validators, which can validate and transform the content of the field. If the form validation is successful, the action will return the validated fields. If not, an error will be thrown. Here’s an example form action specification:
type: form
config:
fields:
– name: date
validators:
– type: IsString
– type: IsTime
config:
format: rfc3339
– name: count
validators:
– type: IsInteger
config:
hasMin: true
min: 0
hasMax: true
max: 1000
This action will ensure each item has a date
field that contains an RFC-3339 formatted data string and a count
field that contains an integer value between 0 and 1000. There are many other validator types available, here’s the full list:
- CanBeAnything: Just ensures the field is present, but does not perform any other validation on it.
- IsBoolean: Ensures the field contains a boolean value, i.e.
true
orfalse
. - IsBytes: Ensures the field contains a byte array with a given
encoding
, e.g.hex
,base64
orbase64-url
encoded data. - IsFloat: Ensures the field contains a float value. Optionally, if
hasMin
orhasMax
istrue
, themin
ormax
config options define the range of the float value. - IsHex: Ensures the field contains a hex value. If
convert
istrue
, the value will be converted to a byte array. - IsIn: Ensures the field contains one of the specified
choices
. - IsInteger: Ensure the filed contains an integer value. Optionally, if
hasMin
orhasMax
istrue
, themin
ormax
config options define the range of the integer value. - IsList: Ensures the field contains a list of values. Optionally, a list of
validators
will be applied to each list element. - IsNotIn: Like IsIn, but ensures the value of the field is not in the given
choices
. - IsOptional: Will skip the other validators if the field is undefined, making it optional.
- IsString: Ensures the field contains a string value. Optionally,
minLength
andmaxLength
specify the minimum and maximum length of the string. - IsStringList: Ensures the field contains a list of strings. Optionally, a list of
validators
will be applied to each list element. - IsStringMap: Ensures the field contains a string-based hash map / dictionary. Optionally, a
form
can be specified that will be used to validate the value. - IsTime: Ensures the field contains a datetime/time object in the given
format
, eitherrfc3339
,rfc3339-date
,unix
,unix-nano
orunix-milli
. Ifraw
is true the string will not be converted to aTime
object. IftoUTC
is true, the resultingTime
object will be converted toUTC
. - IsUUID: Ensures the field contains a UUID value.
- MatchesRegex: Ensures the field matches the given regular expression. Assumes that the field contains a string value, which should be validated before using the
IsString
validator. - Or: Ensures the field contains one of several possible values as specified by a list of
options
, which in turn contain a list of validators. Will stop at the first option that successfully validates the value. - Switch: Validates the field against a list of validators defined in a
cases
string map, where the chosen case depends on the value of another field whose name is given bykey
.
Actions Within Forms
In addition, the form mechanism supports all other actions as well through the IsAction
validator, which takes an action specification as configuration. Example validator:
type: IsAction
config:
type: pseudonymize
config:
method: merengue
This performs a pseudonymization of the value using the merengue
method. Parameters of embedded actions are managed automatically by the form action.
Blueprints
Blueprints are configurations that specify how Kodex should read, analyze transform and write data.
Privacy & security engineering is complex: In a typical scenario, we might want to read data from a variety of sources (e.g. a database, file or API) and process it for different purposes (e.g. analytics or anomaly detection), which often require different trasformations. We also might want to send the data to a variety of destinations (e.g. a database or message queue). Finally, we might want to do all this continously as data is produced.
To make this as easy as possible, we implemented a declarative and expressive config language based on YAML files. Other systems like Kubernetes or Ansible have shown that collections of YAML files can be powerful tools for configuration management. Also, files are easy to work with and can be versioned just like any other configuration file.
Blueprint basics
The basic structure of a blueprint. This example would read data from a SQL database, pseudonymize it and forward the pseudonymized data to a HTTP API.
actions: # different actions we want to apply to our data items
- name: pseudonymize-name
type: pseudonymize
…
sources: # sources we want to load data items from - name: production-db
type: sql
…
destinations: # destinations we want to send data items to - name: audit-api
type: http
…
streams: # streams of data items that we want to process - name: default description: | Pseudonymize incoming data before sending it to the audit API. sources: # sources of data items for this stream
- production-db
configs: # configurations for this stream - name: default actions: # actions in this configuration
- pseudonymize-name
destinations: # destinations that these items should be sent to - name: audit-api
- pseudonymize-name
- production-db
A blueprint essentially contains at least four different sections: Actions, sources, destinations and streams. Each stream refers to one or more sources and contains one or more stream configs. Each stream config refers to one or more actions as well as one or more destinations. Together these primitives tell Kodex how data should flow through it and how it should be analyzed and transformed.
Running blueprints
You can run a given blueprint using the run
command:
kodex run [blueprint name]
If you omit the [blueprint name]
, Kodex will look for a file named .blueprint.yml
in the current directory to run. Otherwise, Kodex will first check if you entered a file location, and if so loads the blueprint from there. Otherwise, Kodex will go through all of its blueprint directories (by default ~/.kodex/blueprints
) and try to find the blueprint you specified. For example, if we run
kodex run pseudonymization/examples/data-types/pseudonymize
Kodex will try to find a blueprint named pseudonymize.yml
in a subfolder pseudonymization/examples/data-types
within any of the given blueprint paths. Sometimes you might have different versions of a blueprint installed. By default, Kodex will load the latest version it can find. If you don’t want that, you can specify a version using the --version
flag:
kodex run my-blueprint –version 0.4.1
Version numbers follow the semantic version specification (2.0).
Blueprint repositories
Kodex lets you download and install blueprints from so-called repositories. For example, to get our official blueprints repository, simply run
This will download the latest snapshot of the master branch of our Blueprints repository and install it into a local directory. If you want to download blueprints from a different URL simple provide it:
kodex blueprints download https://my.blueprints/repo.zip
Kodex will look for a directory with a .blueprints.yml
file in the ZIP archive and extract that directory to the main blueprints path. You can also create own local blueprint repositories of course, just make sure to put them in a subfolder of a blueprints path and create a .blueprints.yml
file with the following content:
package: [your-package-name] # e.g. my-blueprints
version: [your version] # e.g. 1.5.2