Using s3fs-supported pandas API
Pandas is one of the most used libraries in Python for Data Analysis, and Data Science. Storj DCS gives you the ability to store Datasets with a great level of durability, privacy, and security.
In this post, we will look at how to use pandas to save and load data to Storj DCS.
- Configuring pandas
- Saving data to Storj DCS
- Loading data from Storj DCS
We rely on the new feature introduced by pandas called storage_options. This extra option gives us the capability to use specific storage connections. Storage option was introduced on version 1.2.0, further details you can find here storage_options.
From pandas 0.20.1 documentation:
“pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas.”
Installing pandas Version 1.2.0
If you already have a Storj DCS account, you just need to get your keys and endpoint url.
We are going to load the credentials from environment variables. You should have these 3 variables available: ACCESS_KEY_ID, SECRET_ACCESS_KEY and ENDPOINT_URL
This configuration will work for all methods that allows custom storage options such as read_csv, read_excel, read_table etc
We need to override the client_kwargs and set the endpoint_url, in this case the address must be the gateway url. Example: https://gateway.us1.storjshare.io
Saving Data to Storj DCS
In this blog post, we are going to save and load our pandas Dataframe in CSV format. Other formats are allowed too, as mentioned in the previous section.
Loading Data from Storj DCS
The load process is the same, just pass the storage_options as a parameter.
Using pandas + Storj DCS is very easy, just requires a few lines of configuration.
If you already use pandas with S3 the migration to Storj DCS is very straightforward.