Amazon SageMaker Feature Store provides an end-to-end solution to automate feature engineering for machine learning (ML). For many ML use cases, raw data like log files, sensor readings, or transaction records need to be transformed into meaningful features that are optimized for model training.
Feature quality is critical to ensure a highly accurate ML model. Transforming raw data into features using aggregation, encoding, normalization, and other operations is often needed and can require significant effort. Engineers must manually write custom data preprocessing and aggregation logic in Python or Spark for each use case.
This undifferentiated heavy lifting is cumbersome, repetitive, and error-prone. The SageMaker Feature Store Feature Processor reduces this burden by automatically transforming raw data into aggregated features suitable for batch training ML models. It lets engineers provide simple data transformation functions, then handles running them at scale on Spark and managing the underlying infrastructure. This enables data scientists and data engineers to focus on the feature engineering logic rather than implementation details.
In this post, we demonstrate how a car sales company can use the Feature Processor to transform raw sales transaction data into features in three steps:
Local runs of data transformations.
Remote runs at scale using Spark.
Operationalization via pipelines.
We show how SageMaker Feature Store ingests the raw data, runs feature transformations remotely using Spark, and loads the resulting aggregated features into a feature group. These engineered features are can then be used to train ML models.
For this use case, we see how SageMaker Feature Store helps convert the raw car sales data into structured features. These features are subsequently used to gain insights like:
Average and maximum price of red convertibles from 2010
Models with best mileage vs. price
Sales trends of new vs. used cars over the years
Differences in average MSRP across locations
We also see how SageMaker Feature Store pipelines keep the features updated as new data comes in, enabling the company to continually gain insights over time.
We work with the dataset car_data.csv, which contains specifications such as model, year, status, mileage, price, and MSRP for used and new cars sold by the company. The following screenshot shows an example of the dataset.
The solution notebook feature_processor.ipynb contains the following main steps, which we explain in this post:
Create two feature groups: one called car-data for raw car sales records and another called car-data-aggregated for aggregated car sales records.
Use the @feature_processor decorator to load data into the car-data feature group from Amazon Simple Storage Service (Amazon S3).
Run the @feature_processor code remotely as a Spark application to aggregate the data.
Operationalize the feature processor via SageMaker pipelines and schedule runs.
Explore the feature processing pipelines and lineage in Amazon SageMaker Studio.
Use aggregated features to train an ML model.
To follow this tutorial, you need the following:
For this post, we refer to the following notebook, which demonstrates how to get started with Feature Processor using the SageMaker Python SDK.
Create feature groups
To create the feature groups, complete the following steps:
Create a feature group definition for car-data as follows:
The features correspond to each column in the car_data.csv dataset (Model, Year, Status, Mileage, Price, and MSRP).
Add the record identifier id and event time ingest_time to the feature group:
Create a feature group definition for car-data-aggregated as follows:
For the aggregated feature group, the features are model year status, average mileage, max mileage, average price, max price, average MSRP, max MSRP, and ingest time. We add the record identifier model_year_status and event time ingest_time to this feature group.
Now, create the car-data feature group:
Create the car-data-aggregated feature group:
You can navigate to the SageMaker Feature Store option under Data on the SageMaker Studio Home menu to see the feature groups.
Use the @feature_processor decorator to load data
In this section, we locally transform the raw input data (car_data.csv) from Amazon S3 into the car-data feature group using the Feature Store Feature Processor. This initial local run allows us to develop and iterate before running remotely, and could be done on a sample of the data if desired for faster iteration.
With the @feature_processor decorator, your transformation function runs in a Spark runtime environment where the input arguments provided to your function and its return value are Spark DataFrames.
The number of input parameters in your transformation function must match the number of inputs configured in the @feature_processor decorator. In this case, the @feature_processor decorator has car-data.csv as input and the car-data feature group as output, indicating this is a batch operation with the target_store as OfflineStore:
Define the transform() function to transform the data. This function performs the following actions:
Convert column names to lowercase.
Add the event time to the ingest_time column.
Remove punctuation and replace missing values with NA.
Call the transform() function to store the data in the car-data feature group:
The output shows that the data is ingested successfully into the car-data feature group.
The output of the transform_df.show() function is as follows:
We have successfully transformed the input data and ingested it in the car-data feature group.
Run the @feature_processor code remotely
In this section, we demonstrate running the feature processing code remotely as a Spark application using the @remote decorator described earlier. We run the feature processing remotely using Spark to scale to large datasets. Spark provides distributed processing on clusters to handle data that is too big for a single machine. The @remote decorator runs the local Python code as a single or multi-node SageMaker training job.
Use the @remote decorator along with the @feature_processor decorator as follows:
The spark_config parameter indicates this is run as a Spark application. The SparkConfig instance configures the Spark configuration and dependencies.
Define the aggregate() function to aggregate the data using PySpark SQL and user-defined functions (UDFs). This function performs the following actions:
Concatenate model, year, and status to create model_year_status.
Take the average of price to create avg_price.
Take the max value of price to create max_price.
Take the average of mileage to create avg_mileage.
Take the max value of mileage to create max_mileage.
Take the average of msrp to create avg_msrp.
Take the max value of msrp to create max_msrp.
Group by model_year_status.
Run the aggregate() function, which creates a SageMaker training job to run the Spark application:
As a result, SageMaker creates a training job to the Spark application defined earlier. It will create a Spark runtime environment using the sagemaker-spark-processing image.
We use SageMaker Training jobs here to run our Spark feature processing application. With SageMaker Training, you can reduce startup times to 1 minute or less by using warm pooling, which is unavailable in SageMaker Processing. This makes SageMaker Training better optimized for short batch jobs like feature processing where startup time is important.
To view the details, on the SageMaker console, choose Training jobs under Training in the navigation pane, then choose the job with the name aggregate-<timestamp>.
The output of the aggregate() function generates telemetry code. Inside the output, you will see the aggregated data as follows:
When the training job is complete, you should see following output:
Operationalize the feature processor via SageMaker pipelines
In this section, we demonstrate how to operationalize the feature processor by promoting it to a SageMaker pipeline and scheduling runs.
First, upload the transformation_code.py file containing the feature processing logic to Amazon S3:
Next, create a Feature Processor pipeline car_data_pipeline using the .to_pipeline() function:
To run the pipeline, use the following code:
Similarly, you can create a pipeline for aggregated features called car_data_aggregated_pipeline and start a run.
Schedule the car_data_aggregated_pipeline to run every 24 hours:
In the output section, you will see the ARN of pipeline and the pipeline execution role, and the schedule details:
To get all the Feature Processor pipelines in this account, use the list_pipelines() function on the Feature Processor:
The output will be as follows:
We have successfully created SageMaker Feature Processor pipelines.
Explore feature processing pipelines and ML lineage
In SageMaker Studio, complete the following steps:
On the SageMaker Studio console, on the Home menu, choose Pipelines.
You should see two pipelines created: car-data-ingestion-pipeline and car-data-aggregated-ingestion-pipeline.
Choose the car-data-ingestion-pipeline.
It shows the run details on the Executions tab.
To view the feature group populated by the pipeline, choose Feature Store under Data and choose car-data.
You will see the two feature groups we created in the previous steps.
Choose the car-data feature group.
You will see the features details on the Features tab.
View pipeline runs
To view the pipeline runs, complete the following steps:
On the Pipeline Executions tab, select car-data-ingestion-pipeline.
This will show all the runs.
Choose one of the links to see the details of the run.
To view lineage, choose Lineage.
The full lineage for car-data shows the input data source car_data.csv and upstream entities. The lineage for car-data-aggregated shows the input car-data feature group.
Choose Load features and then choose Query upstream lineage on car-data and car-data-ingestion-pipeline to see all the upstream entities.
The full lineage for car-data feature group should look like the following screenshot.
Similarly, the lineage for the car-aggregated-data feature group should look like the following screenshot.
SageMaker Studio provides a single environment to track scheduled pipelines, view runs, explore lineage, and view the feature processing code.
The aggregated features such as average price, max price, average mileage, and more in the car-data-aggregated feature group provide insight into the nature of the data. You can also use these features as a dataset to train a model to predict car prices, or for other operations. However, training the model is out of scope for this post, which focuses on demonstrating the SageMaker Feature Store capabilities for feature engineering.
Don’t forget to clean up the resources created as part of this post to avoid incurring ongoing charges.
Disable the scheduled pipeline via the fp.schedule() method with the state parameter as Disabled:
Delete both feature groups:
The data residing in the S3 bucket and offline feature store can incur costs, so you should delete them to avoid any charges.
In this post, we demonstrated how a car sales company used SageMaker Feature Store Feature Processor to gain valuable insights from their raw sales data by:
Ingesting and transforming batch data at scale using Spark
Operationalizing feature engineering workflows via SageMaker pipelines
Providing lineage tracking and a single environment to monitor pipelines and explore features
Preparing aggregated features optimized for training ML models
By following these steps, the company was able to transform previously unusable data into structured features that could then be used to train a model to predict car prices. SageMaker Feature Store enabled them to focus on feature engineering rather than the underlying infrastructure.
We hope this post helps you unlock valuable ML insights from your own data using SageMaker Feature Store Feature Processor!
For more information on this, refer to Feature Processing and the SageMaker example on Amazon SageMaker Feature Store: Feature Processor Introduction.
About the Authors
Dhaval Shah is a Senior Solutions Architect at AWS, specializing in Machine Learning. With a strong focus on digital native businesses, he empowers customers to leverage AWS and drive their business growth. As an ML enthusiast, Dhaval is driven by his passion for creating impactful solutions that bring positive change. In his leisure time, he indulges in his love for travel and cherishes quality moments with his family.
Ninad Joshi is a Senior Solutions Architect at AWS, helping global AWS customers design secure, scalable, and cost effective solutions in cloud to solve their complex real-world business challenges. His work in Machine Learning (ML) covers a wide range of AI/ML use cases, with a primary focus on End-to-End ML, Natural Language Processing, and Computer Vision. Prior to joining AWS, Ninad worked as a software developer for 12+ years. Outside of his professional endeavors, Ninad enjoys playing chess and exploring different gambits.