SageMaker provides the compute capacity to build, train and deploy ML models. You can load data from AWS S3
to SageMaker to create, train and deploy models in SageMaker.
You can load data from AWS S3 into AWS SageMaker using the Boto3 library.
In this tutorial, you’ll learn how to load data from AWS S3 into SageMaker jupyter notebook.
This will only access the data from S3
. The files will not be downloaded to the SageMaker Instance itself. If you want to download the file to the SageMaker instance, read How to Download File From S3 Using Boto3 [Python]?
Table of Contents
Prerequisite
- Sagemaker instance MUST have read access to your S3 buckets. Assign the role
AmazonSageMakerServiceCatalogProductsUseRole
while creating SageMaker instance. Refer this link for more details about SageMaker Roles - Install pandas dataframe using
pip install pandas
to read csv file as dataframe. In most cases, it is available as default package
Installing Boto3
If you’ve not installed boto3 yet, you can install it by using the below snippet.
You can use the % symbol before pip to install packages directly from the Jupyter notebook instead of launching the Anaconda Prompt.
Snippet
%pip install boto3
Boto3 will be installed successfully.
Now, you can use it to access AWS resources.
Loading CSV file from S3 Bucket Using URI
In this section, you’ll load the CSV file from the S3 bucket using the S3 URI.
There are two options to generate the S3 URI. They are
- Copying object URL from the AWS
S3
Console. - Generate the URI manually by using the String format option. (This is demonstrated in the below example)
Follow the below steps to load the CSV file from the S3 bucket.
- Import
pandas
package to readcsv
file as a dataframe - Create a variable
bucket
to hold the bucket name. - Create the
file_key
to hold the name of the s3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket. - Concatenate the bucket name and the object name with the prefix
s3://
to generate the URI of the S3 object - Use the generated URI in the
read_csv()
method of the pandas package and store it in the dataframe object calleddf
In the example, the object is available in the bucket stackvidhya
and sub-folder called csv_files
. Hence you’ll use the bucket name as stackvidhya
and the file_key as csv_files/IRIS.csv
Snippet
import pandas as pd
bucket='stackvidhya'
file_key = 'csv_files/IRIS.csv'
s3uri = 's3://{}/{}'.format(bucket, file_key)
df = pd.read_csv(s3uri)
df.head()
The CSV file will be read from the S3
location as a pandas dataframe.
You can print the dataframe using df.head()
which will print the first five rows of the dataframe as shown below.
Dataframe will look like
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
This is how you can access s3 data into a sagemaker jupyter notebook without using any external libraries.
In this method, the file is also not downloaded into the notebook directly.
Next, you’ll learn about using external libraries to load the data.
Loading CSV file from S3 Bucket using Boto3
In this section, you’ll use the Boto3.
Boto3 is an AWS SDK for creating, managing, and access AWS services such as S3 and EC2 instances.
Follow the below steps to access the file from S3
- Import
pandas
package to readcsv
file as a dataframe - Create a variable
bucket
to hold the bucket name. - Create the
file_key
to hold the name of the s3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket. - Create an
s3
client using theboto3.client('s3')
. Boto3 Client is a low level representation of the AWS services. - Get s3 object using the
s3_client.get_object()
method. Pass the bucket name and the file key you can created in the previous step. It’ll return the s3 data as response. (Stored asobj
) - Read the the object body using
obj['Body'].read()
. It’ll return the bytes. Convert these bytes to String usingio.BytesIO()
. - This string can be passed to
read_csv()
available in pandas. Then you’ll get a dataframe.
Snippet
import pandas as pd
import boto3
import io
bucket='stackvidhya'
file_key = 'csv_files/IRIS.csv'
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=file_key)
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
df.head()
The dataframe can be printed using the df.head()
method. It’ll print the first five rows of the dataframe as shown below.
Dataframe Will Look Like
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
You can also use the same steps to access files from S3
in jupyter notebook(outside of sagemaker).
Just pass the AWS API security credentials while creating the boto3 client as shown below. Refer to the tutorial How to create AWS security credentials to create credentials.
Snippet
s3_client = boto3.client('s3', aws_access_key_id='AWS_SERVER_PUBLIC_KEY', aws_secret_access_key='AWS_SERVER_SECRET_KEY', region_name=REGION_NAME )
This is how you can read CSV files into sagemaker using boto3.
Next, you’ll learn about the package awswrangler
.
Loading CSV File into Sagemaker using AWS Wrangler
In this section, you’ll learn how to access data from AWS s3 using AWS Wrangler.
AWS Wrangler is an AWS professional service open-source python library that extends the functionalities of the pandas library to AWS by connecting dataframe and other data-related services.
This package is not installed by default.
Installing AWSWrangler
Install the awswrangler
by using the pip install command.
This works only when you work with the AWS SageMaker instance at the time of writing this tutorial. It doesn’t work in our normal computer Jupyter notebook kernel.
%
needs to be prefixed to pip
command, so the installation directly works from the jupyter notebook.
Snippet
%pip install awswrangler
You’ll see the below messages and the AWS Data wrangler will be installed.
Output
Collecting awswrangler
Downloading awswrangler-2.8.0-py3-none-any.whl (179 kB)
Installing collected packages: scramp, redshift-connector, pymysql, pg8000, awswrangler
Successfully installed awswrangler-2.8.0 pg8000-1.19.5 pymysql-1.0.2 redshift-connector-2.0.881 scramp-1.4.0
Note: you may need to restart the kernel to use updated packages.
Now, restart the kernel using the Kernel -> Restart option for activating the package.
Once the kernel is restarted, you can use the awswrangler
to access data from AWS s3 in your sagemaker notebook.
Follow the below steps to access the file from S3
using AWSWrangler.
- import
pandas
package to readcsv
file as a dataframe - import
awswrangler
aswr
- Create a variable
bucket
to hold the bucket name. - Create the
file_key
to hold the name of theS3
object. You can prefix the subfolder names, if your object is under any subfolder of the bucket. - Concatenate bucket name and the file key to generate the
s3uri
. - Use the
read_csv()
method inawswrangler
to fetch theS3
data using the linewr.s3.read_csv(path=s3uri)
.
Snippet
import awswrangler as wr
import pandas as pd
bucket='stackvidhya'
file_key = 'csv_files/IRIS.csv'
s3uri = 's3://{}/{}'.format(bucket, file_key)
df = wr.s3.read_csv(path=s3uri)
df.head()
readcsv()
method will return a pandas dataframe out of CSV data. You can print the dataframe using df.head()
which will return the first five rows of the dataframe as shown below.
Dataframe Will Look Like
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
This is how you can load the CSV file from S3
using awswrangler
.
Next, you’ll see how to read a normal text file.
Read Text File from S3
You’ve seen how to read the CSV file
from AWS S3
in a sagemaker notebook.
In this section, you’ll see how to access a normal text file from `S3 and read its content.
As seen before, you can create an S3
client and get the object from S3
client using the bucket name and the object key.
Then you can read the object body using the read()
method.
The read method will return the file contents as bytes.
You can decode the bytes into strings using the contents.decode('utf-8')
. UTF-8 is the most used charset encoding.
Snippet
import boto3
bucket='stackvidhya'
data_key = 'text_files/testfile.txt'
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=data_key)
contents = obj['Body'].read()
print(contents.decode("utf-8"))
Output
This is a test file to demonstrate the file access functionlity from AWS S3 into sagemaker notebook
Conclusion
To summarize, you’ve learned how to access or load data from AWS S3
into sagemaker jupyter notebook using the packages boto3
and awswrangler
.
You’ve also learned how to access the file without using any additional packages.
If you’ve any questions, feel free to comment below.
You May Also Like
- How To Write Pandas Dataframe As CSV To S3 Using Boto3 Python?
- How to copy or move files between buckets using boto3?
- How to List Contents of s3 Bucket Using Boto3 Python?
- How To Read JSON File From S3 Using Boto3 Python? – Detailed Guide
- How to Read file Line By Line in Python?
- How To check if a key exists in an S3 bucket using boto3 python