How To Load Data From AWS S3 into Sagemaker (Using Boto3 or AWSWrangler)

SageMaker provides the compute capacity to build, train and deploy ML models. You can load data from AWS S3 to SageMaker to create, train and deploy models in SageMaker.

You can load data from AWS S3 into AWS SageMaker using the Boto3 library.

In this tutorial, you’ll learn how to load data from AWS S3 into SageMaker jupyter notebook.

This will only access the data from S3. The files will not be downloaded to the SageMaker Instance itself. If you want to download the file to the SageMaker instance, read How to Download File From S3 Using Boto3 [Python]?

Prerequisite

  • Sagemaker instance MUST have read access to your S3 buckets. Assign the role AmazonSageMakerServiceCatalogProductsUseRole while creating SageMaker instance. Refer this link for more details about SageMaker Roles
  • Install pandas dataframe using pip install pandas to read csv file as dataframe. In most cases, it is available as default package

Installing Boto3

If you’ve not installed boto3 yet, you can install it by using the below snippet.

You can use the % symbol before pip to install packages directly from the Jupyter notebook instead of launching the Anaconda Prompt.

Snippet

%pip install boto3

Boto3 will be installed successfully.

Now, you can use it to access AWS resources.

Loading CSV file from S3 Bucket Using URI

In this section, you’ll load the CSV file from the S3 bucket using the S3 URI.

There are two options to generate the S3 URI. They are

  • Copying object URL from the AWS S3 Console.
  • Generate the URI manually by using the String format option. (This is demonstrated in the below example)

Follow the below steps to load the CSV file from the S3 bucket.

  • Import pandas package to read csv file as a dataframe
  • Create a variable bucket to hold the bucket name.
  • Create the file_key to hold the name of the s3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket.
  • Concatenate the bucket name and the object name with the prefix s3:// to generate the URI of the S3 object
  • Use the generated URI in the read_csv() method of the pandas package and store it in the dataframe object called df

In the example, the object is available in the bucket stackvidhya and sub-folder called csv_files. Hence you’ll use the bucket name as stackvidhya and the file_key as csv_files/IRIS.csv

Snippet

import pandas as pd

bucket='stackvidhya'

file_key = 'csv_files/IRIS.csv'

s3uri = 's3://{}/{}'.format(bucket, file_key)

df = pd.read_csv(s3uri)

df.head()

The CSV file will be read from the S3 location as a pandas dataframe.

You can print the dataframe using df.head() which will print the first five rows of the dataframe as shown below.

Dataframe will look like

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa

This is how you can access s3 data into a sagemaker jupyter notebook without using any external libraries.

In this method, the file is also not downloaded into the notebook directly.

Next, you’ll learn about using external libraries to load the data.

Loading CSV file from S3 Bucket using Boto3

In this section, you’ll use the Boto3.

Boto3 is an AWS SDK for creating, managing, and access AWS services such as S3 and EC2 instances.

Follow the below steps to access the file from S3

  1. Import pandas package to read csv file as a dataframe
  2. Create a variable bucket to hold the bucket name.
  3. Create the file_key to hold the name of the s3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket.
  4. Create an s3 client using the boto3.client('s3'). Boto3 Client is a low level representation of the AWS services.
  5. Get s3 object using the s3_client.get_object() method. Pass the bucket name and the file key you can created in the previous step. It’ll return the s3 data as response. (Stored as obj)
  6. Read the the object body using obj['Body'].read(). It’ll return the bytes. Convert these bytes to String using io.BytesIO().
  7. This string can be passed to read_csv() available in pandas. Then you’ll get a dataframe.

Snippet

import pandas as pd
import boto3
import io

bucket='stackvidhya'

file_key = 'csv_files/IRIS.csv'

s3_client = boto3.client('s3')

obj = s3_client.get_object(Bucket=bucket, Key=file_key)

df = pd.read_csv(io.BytesIO(obj['Body'].read()))

df.head()

The dataframe can be printed using the df.head() method. It’ll print the first five rows of the dataframe as shown below.

Dataframe Will Look Like

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa

You can also use the same steps to access files from S3 in jupyter notebook(outside of sagemaker).

Just pass the AWS API security credentials while creating the boto3 client as shown below. Refer to the tutorial How to create AWS security credentials to create credentials.

Snippet

s3_client = boto3.client('s3', aws_access_key_id='AWS_SERVER_PUBLIC_KEY', aws_secret_access_key='AWS_SERVER_SECRET_KEY', region_name=REGION_NAME )

This is how you can read CSV files into sagemaker using boto3.

Next, you’ll learn about the package awswrangler.

Loading CSV File into Sagemaker using AWS Wrangler

In this section, you’ll learn how to access data from AWS s3 using AWS Wrangler.

AWS Wrangler is an AWS professional service open-source python library that extends the functionalities of the pandas library to AWS by connecting dataframe and other data-related services.

This package is not installed by default.

Installing AWSWrangler

Install the awswrangler by using the pip install command.

This works only when you work with the AWS SageMaker instance at the time of writing this tutorial. It doesn’t work in our normal computer Jupyter notebook kernel.

% needs to be prefixed to pip command, so the installation directly works from the jupyter notebook.

Snippet

%pip install awswrangler

You’ll see the below messages and the AWS Data wrangler will be installed.

Output

    Collecting awswrangler
      Downloading awswrangler-2.8.0-py3-none-any.whl (179 kB)

    Installing collected packages: scramp, redshift-connector, pymysql, pg8000, awswrangler
    Successfully installed awswrangler-2.8.0 pg8000-1.19.5 pymysql-1.0.2 redshift-connector-2.0.881 scramp-1.4.0
    Note: you may need to restart the kernel to use updated packages.

Now, restart the kernel using the Kernel -> Restart option for activating the package.

Once the kernel is restarted, you can use the awswrangler to access data from AWS s3 in your sagemaker notebook.

Follow the below steps to access the file from S3 using AWSWrangler.

  1. import pandas package to read csv file as a dataframe
  2. import awswrangler as wr
  3. Create a variable bucket to hold the bucket name.
  4. Create the file_key to hold the name of the S3 object. You can prefix the subfolder names, if your object is under any subfolder of the bucket.
  5. Concatenate bucket name and the file key to generate the s3uri.
  6. Use the read_csv() method in awswrangler to fetch the S3 data using the line wr.s3.read_csv(path=s3uri).

Snippet

import awswrangler as wr

import pandas as pd

bucket='stackvidhya'

file_key = 'csv_files/IRIS.csv'

s3uri = 's3://{}/{}'.format(bucket, file_key)

df = wr.s3.read_csv(path=s3uri)

df.head()

readcsv() method will return a pandas dataframe out of CSV data. You can print the dataframe using df.head() which will return the first five rows of the dataframe as shown below.

Dataframe Will Look Like

sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa

This is how you can load the CSV file from S3 using awswrangler.

Next, you’ll see how to read a normal text file.

Read Text File from S3

You’ve seen how to read the CSV file from AWS S3 in a sagemaker notebook.

In this section, you’ll see how to access a normal text file from `S3 and read its content.

As seen before, you can create an S3 client and get the object from S3 client using the bucket name and the object key.

Then you can read the object body using the read() method.

The read method will return the file contents as bytes.

You can decode the bytes into strings using the contents.decode('utf-8'). UTF-8 is the most used charset encoding.

Snippet

import boto3

bucket='stackvidhya'

data_key = 'text_files/testfile.txt'

s3_client = boto3.client('s3')

obj = s3_client.get_object(Bucket=bucket, Key=data_key)

contents = obj['Body'].read()

print(contents.decode("utf-8"))

Output

This is a test file to demonstrate the file access functionlity from AWS S3 into sagemaker notebook

Conclusion

To summarize, you’ve learned how to access or load data from AWS S3 into sagemaker jupyter notebook using the packages boto3 and awswrangler.

You’ve also learned how to access the file without using any additional packages.

If you’ve any questions, feel free to comment below.

You May Also Like

Leave a Comment