Deep storage configuration

Deep Storage is where Druid stores data segments. For a Kubernetes environment, either the HDFS or S3 backend is recommended.

HDFS

Druid can use HDFS as a backend for deep storage, which requires having an HDFS instance running. You can use the Stackable Operator for Apache HDFS to run HDFS. Configure the HDFS deep storage backend in your Druid cluster this way:

spec:
  clusterConfig:
    deepStorage:
      hdfs:
        configMapName: simple-hdfs (1)
        directory: /druid (2)
...
1 Name of the HDFS cluster discovery ConfigMap. Can be supplied manually for a cluster not provided by Stackable. Needs to contain the core-site.xml and hdfs-site.xml.
2 The directory where to store the druid data.

S3

Druid can use S3 as a backend for deep storage:

spec:
  clusterConfig:
    deepStorage:
      s3:
        bucket:
          inline:
            bucketName: my-bucket  (1)
            connection:
              inline:
                host: test-minio  (2)
                port: 9000  (3)
                credentials:  (4)
                ...
1 Bucket name.
2 Bucket host.
3 Optional bucket port.
4 Credentials explained below.

It is also possible to configure the bucket connection details as a separate Kubernetes resource and only refer to that object from the DruidCluster like this:

spec:
  clusterConfig:
    deepStorage:
      s3:
        bucket:
          reference: my-bucket-resource (1)
1 Name of the bucket resource with connection details.

The resource named my-bucket-resource is then defined as shown below:

---
apiVersion: s3.stackable.tech/v1alpha1
kind: S3Bucket
metadata:
  name: my-bucket-resource
spec:
  bucketName: my-bucket-name
  connection:
    inline:
      host: test-minio
      port: 9000
      credentials:
        ... (explained below)

This has the advantage that bucket configuration can be shared across DruidClusters (and other stackable CRDs) and reduces the cost of updating these details.

You can specify just a connection/bucket for either ingestion or deep storage or for both, but Druid only supports a single S3 connection under the hood. If two connections are specified, they must be the same. This is easiest if a dedicated S3 Connection Resource is used - not defined inline but as a dedicated object.

TLS for S3 is not yet supported.

S3 Credentials

No matter if a connection is specified inline or as a separate object, the credentials are always specified in the same way. You need a Secret containing the access key ID and secret access key, a SecretClass and then a reference to this SecretClass where you want to specify the credentials.

The Secret:

apiVersion: v1
kind: Secret
metadata:
  name: s3-credentials
  labels:
    secrets.stackable.tech/class: s3-credentials-class  (1)
stringData:
  accessKey: YOUR_VALID_ACCESS_KEY_ID_HERE
  secretKey: YOUR_SECRET_ACCES_KEY_THATBELONGS_TO_THE_KEY_ID_HERE
1 This label connects the Secret to the SecretClass.

The SecretClass:

apiVersion: secrets.stackable.tech/v1alpha1
kind: SecretClass
metadata:
  name: s3-credentials-class
spec:
  backend:
    k8sSearch:
      searchNamespace:
        pod: {}

Referencing it:

...
credentials:
  secretClass: s3-credentials-class
...