Apache Pulsar Tiered Storage with Storj DCS

May 20, 2021

Apache Pulsar is an open-source, cloud-native distributed tiered messaging system that is a part of the Apache Software Foundation. This distributed messaging and streaming platform manages hundreds of billions of events per day and is widely deployed across enterprise-grade systems, including Splunk, Overstock.com, Verizon, Comcast, and Toast. 

Pulsar features Tiered Storage that allows developers to offload the non-compacted data, making it economical to store for long periods of time.

Today, we would like to showcase an open-source integration and walk through the process of offloading event data from Pulsar to Storj DCS (Decentralized Cloud Storage).

"Pulsar has reshaped the way that the industry thinks about a modern messaging and event streaming architecture," said Chris Latimer, VP of Product, Streaming at Datastax. "At the same time Storj's Decentralized Cloud Storage has pushed the boundaries of distributed persistence. When you combine these technologies you end up with a highly efficient solution both in terms of cost and performance."

Tiered storage enables a more efficient messaging stack

Event sourcing architectures commonly have developers keep messages forever - resulting in costly storage on VME disks. 

Tiered Storage in Apache Pulsar solves this problem, making it easier to reduce the total cost of Data Ownership related to messaging systems, while still guaranteeing the integrity and availability of the data.

With high-performance delivery, you need expensive disks. As messages get older, you don't care about performance as much and can offload them to cheaper cloud storage.

In Apache Pulsar, the bookkeeping process packs messages into an ordered list of segments. Any segment short of the current segment being written to can be offloaded (in this case, to the decentralized cloud). A namespace policy can be used to automate when this offload is triggered.  

$ bin/pulsar-admin namespaces set-offload-threshold --size 10M my-tenant/my-namespace

The default Pulsar MaxBlockSize precisely matches the ideal ingest block size of the Storj DCS network at 64MB. To reduce the number of orders on the decentralized network, a ReadBufferSize of 64MB is ideal while the default Pulsar configuration of 1MB is supported. 

Pulsar – Message Replay with Storj DCS

The ability to replay messages is critical when working with producers that may not be able to replay or may be of unknown reliability. Replay capability extends flexibility allowing you to test, recover, or repair without reliance on producers. New applications or algorithms requiring historical data can be quickly synced to the current state. 

Regardless of the driver, replay capability is a valuable addition for Pulsar messaging services. 

Getting Started: Distributed Messaging and The Distributed Cloud

Datastax has built out an integration that enables Pulsar users to offload their messages to Storj DCS for tiered storage. 

The helm chart installation and deployment guide is located on the Datastax GitHub repo, here:  https://github.com/datastax/pulsar-helm-chart/blob/master/helm-chart-sources/pulsar/values.yaml

Tiered storage can be configured in the storageOffload section of the values.yaml file.  There is explicit support for Storj DCS, which is a provider of secure, decentralized storage. You can enable the Storj DCS S3 gateway in the extras configuration. The instructions for configuring the gateway are provided in the Storj DCS section of the values.yaml file.



If you’d like to try the integration without running any infrastructure locally, you can use the multi tenant Storj Gateway, Gateway MT.

Storj S3 Gateway Driver configuration

  1. Create and login to your free account on Storj DCS
  2. Go to “Objects” and create your first S3 bucket:

3.  On your Dashboard, create your access grants (if you did not create an access grant during the initial setup wizard):

 

Storj Plesk blog 1

4.   While creating your credentials, select “Generate S3 Gateway Credentials” in the last step:

 

Storj Plesk blog 2


S3 Configuration for Apache Pulsar

We will be using the credentials used in the previous step (Storj Gateway MT console) to configure tiered messaging offloading.  Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in conf/pulsar_env.sh.

"export AWS_ACCESS_KEY_ID=ABC123456789"
"export AWS_SECRET_ACCESS_KEY=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c"

Copy

"export" is important so that the variables are made available in the environment of spawned processes.

  1. Add the Java system properties aws.accessKeyId and aws.secretKey to PULSAR_EXTRA_OPTS in conf/pulsar_env.sh.
PULSAR_EXTRA_OPTS="${PULSAR_EXTRA_OPTS} ${PULSAR_MEM} ${PULSAR_GC}
-Daws.accessKeyId=ABC123456789
-Daws.secretKey=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c
-Dio.netty.leakDetectionLevel=disabled
-Dio.netty.recycler.maxCapacity.default=1000
-Dio.netty.recycler.linkCapacity=1024"
  1. Set the access credentials in ~/.aws/credentials with the credentials generated in the Storj DCS Gateway
[default]
aws_access_key_id=ABC123456789
aws_secret_access_key=ded7db27a4558e2ea8bbf0bf37ae0e8521618f366c

This will use the "DefaultAWSCredentialsProviderChain" for assuming this role.

  • The broker must be rebooted for credentials specified in pulsar_env to take effect.

Why Pulsar?

Put simply, Pulsar’s tiered storage model and ability for easy message replay makes it a great tool that plays well with Storj DCS.

Data ingestion and messaging are the starting point for modern data applications. As data and machine learning continue to grow in importance, companies must make sure they have the right messaging and storage systems in place.

Apache Pulsar was designed with a multi-layer architecture in which each layer is scalable, distributed, and decoupled from the other layers. With Pulsar, you can add new topics as needed and seamlessly scale performance. 

With Storj DCS, Pulsar developers gain a more economical solution stack that's more optimized for performant delivery at the edge.

Conclusion

Many companies and technologists have begun to utilize Apache Pulsar for PubSub messaging and integrating it into their application builds.  

Martin Fowler’s TechRadar has written about Pulsar, stating: "We're also looking to Pulsar to solve the problem of a never-ending log of messages for our large-scale data systems where events are expected to persist indefinitely, and subscribers can start consuming messages retrospectively. This is supported through a tiered storage model."

---------------------------------
We look forward to gathering feedback from the Storj DCS Community around this integration. If you are interested in integrating Storj DCS and Pulsar into your application stack, please reach out to us directly: partnerships@storj.io.

Share this blog post

Put Storj to the test.

It’s simple to set up and start using Storj. Sign up now to get 25GB free for 30 days.
Start your trial