AWS Glue is basically a data processing pipeline that is composed of a crawler, jobs, and triggers. This workflow converts uploaded data files into Apache Parquet format. In this blog, we will see how we can make use of the AWS Glue event-driven workflows to demonstrate the execution of the entire flow. With the use of the command line, we will copy the files created locally to the S3 bucket that will trigger the AWS Glue workflow and convert the files into the parquet format on completion.
Refer to Part 1 of the blog here.
Refer to Part 2 of the blog here.
What is AWS Glue?
What is Amazon EventBridge?
Hands-On
In this hands-on, we will see if we can deploy the provided AWS CloudFormation stack to initiate and create an event-driven workflow to initiate the process of converting the files in the S3 bucket into the parquet format using AWS Glue and monitor the logs via Amazon EventBridge. With this event flow, we can trigger a data integration workflow from any events from AWS services, software-as-a-service (SaaS) providers, and any custom applications. If your environment generates many events, AWS Glue allows you to batch them either by the number of events or by time duration. Event-driven workflows make it easy to start an AWS Glue workflow based on real-time events. The CloudFormation template will provision AWS Glue workflow including a crawler, jobs, and triggers. The first trigger in the workflow will be configured as an event-based trigger. In this blog, we will have a look at a solution to set up an AWS Glue workflow that listens to S3 PutObject data events captured by AWS CloudTrail. This workflow will be configured to run when five new files are added or the batching window time of 900 seconds expires after the first file is added. The CloudFormation template generates the following:
S3 bucket: Will be used to storing data, CloudTrail logs, job scripts, and any temporary files generated during the AWS Glue ETL job run.
AWS Glue workflow: A data processing pipeline that will be composed of a crawler, jobs, and triggers. This workflow converts uploaded data files into Apache Parquet format.
AWS Glue database: The AWS Glue Data Catalog database will be used to hold the tables created in this hands-on.
AWS Glue table: The Data Catalog table representing the Parquet files being converted by the workflow.
AWS Lambda function: Will be used as an AWS CloudFormation custom resource to copy job scripts from an AWS Glue-managed GitHub repository and an AWS Big Data blog S3 bucket to your S3 bucket.
IAM roles and policies: We will be using the following AWS Identity and Access Management (IAM) roles.
LambdaExecutionRole: Runs the Lambda function that will have permission to upload the job scripts to the S3 bucket.
GlueServiceRole: Runs the AWS Glue job that will have permission to download the script, read data from the source, and write data to the destination after conversion.
EventBridgeGlueExecutionRole: Permissions to invoke the NotifyEvent API for an AWS Glue workflow.
To implement this, we will do the following:
- Login to your AWS console and navigate to the dashboard.
- Using the CloudFormation template, navigate to the CloudFormation template configuration dashboard.
- Explore the template in designer mode.
- Explore the designer mode and the various offerings.
- Follow the steps of configuration for creating the template and deploying the required resources.
- Navigate to the deployed services and policies dashboard to review the created resources.
- Explore the CloudFormation template configuration dashboard.
- Create an AWS Glue workflow with a starting trigger of EVENT type and configure the batch size on the trigger to be five and the batch window to be 900 seconds.
- Configure Amazon S3 to log data events
- Create a rule in EventBridge
- Add an AWS Glue event-driven workflow as a target to the EventBridge rule.
- Start the workflow, and upload files to the S3 bucket.
- You need to have at least five files before the workflow is triggered so make sure to upload 5 files to the bucket.
- Verify the converted files in the S3 bucket once the workflow execution is completed.
- Navigate to the EventBridge dashboard rules and explore the CloudWatch logs.
- Terminate the stack if you are following the hands-on for learning purposes.
Login to your AWS console and navigate to the dashboard.
Using the link below, navigate to the deployment page of the AWS CloudFormation template.
You will be navigated to the screen as shown in the image below with all the details filled in. Click on View in Designer to know what services will be deployed using the template.
Here, you can see what all services will be deployed using the provided template.
In the bottom, choose components and under the Parameters tab, you will see the details of all the services being deployed.
Navigate back to the configuration dashboard. Scroll down and click on Next.
In the next step, either alter the name for the stack or leave it as it is by default.
Scroll down and enter a name for the S3 bucket which is to be created. Leave other settings by default. Click on Next.
Add tags if any are needed for your CloudFormation template. Under Permissions attach an IAM role if needed for your template.
For stack failure options choose the option shown in the image below.
Under Advanced options, for stack policy choose the option as selected in the image below.
You can configure the Rollback configuration as well.
Under the Notification settings, you can select an SNS topic if needed for receiving notifications.
You can configure the termination of the template after a timeout using the Stack creation options. Click on Next.
Review all the settings for the CloudFormation template.
Scroll down to the bottom of the page and check the acknowledgment box and click on Create stack.
Once done, you will see the status as CREATE_IN_PROGRESS.
Select the Resources tab and you will see all the resources being created.
Under the parameters tab, you will see the parameters configured in the stack.
Under the Template tab, you will find the entire template for your configuration.
Finally, under the Stack info tab, you can view the details related to the stack and the creation status for the same.
Now, navigate to the resources tab and navigate to the link for the services deployed to review the services creation status.
Under the IAM policy, you will find the policy attached to the user.
Navigating to the S3 dashboard, you will see the newly created bucket and you will find the folders created for script and cloudtrail logs.
Navigating to the EventBridge dashboard, you will find the event pattern created.
Scroll down and you will see the Glue ARN attached to the EventBridge Rule.
When you navigate to the Lambda dashboard, you will find the lambda function using by the template.
Now, search for the AWS Glue service. Click on it and navigate to the dashboard.
You will be navigated to the AWS Glue dashboard.
Click on Triggers in the left navigation pane.
Search for the trigger with the name <Workflow_name>_pre_job_trigger. Click on Edit.
You will be navigated to the configuration dashboard.
Enter 5 for number of events and 900 for time delay in seconds. Click on Next.
On the Next page, leave the configuration as it is by default.
Scroll down to the bottom of the page and click on Next.
On the next page, click on Finish.
Now, create a new folder on your local machine. Open the command terminal and execute the below command to create a JSON for 1 product.
Similarly, run the same command for the next 4 to 5 products.
echo '{"product_id": "00001", "product_name": "Television", "created_at": "2021-06-01"}' > product_00001.json
Execute the below command:
aws configure
Login to the IAM user using the Access Key ID and Secret Access Key. Executing the below command, move the JSON files from your local folder to the newly created S3 bucket.
aws s3 cp product_00001.json s3://<bucket-name>/data/products_raw/
Navigate back to the S3 bucket and you will see the files in the S3 bucket.
Navigate back to the AWS Glue dashboard and select Workflows from the left navigation pane. You will see the flow in the Running state.
Select the workflow and choose the History tab. Click on View run details.
If you scroll down, under the Graph, you will see the execution status of the workflow. Once everything is green and checked, navigate back to the S3 bucket.
You can see the Parquet files in the newly created bucket in the form of converted files.
To view logs, navigate to the Amazon EventBridge dashboard under the Rules tab. Select the newly created rule. Open the selected rule.
Click on Metrics for the rule to navigate to the Cloudwatch logs.
Select both the metrics and on the graph, you will see the Invocation and TriggeredRules count.
In case, you are following this handson for learning, make sure to navigate back to the CloudFormation service and delete the stack.
In the modal, click on Delete stack.
The deletion process will then be initiated and all the deployed resources will then be deleted.
Conclusion
In this blog, we saw how we can initiate and create an event-driven workflow to trigger the process of converting the files in the S3 bucket into the parquet format using AWS Glue and monitor the logs via Amazon EventBridge. We had a look at a solution to set up an AWS Glue workflow that listens to S3 PutObject data events captured by AWS CloudTrail. We created a new AWS Glue trigger of type EVENT and placed it as the first trigger in the workflow. We had a look at the event batching since without event batching, the AWS Glue workflow is triggered every time an EventBridge rule matches which may result in multiple concurrent workflow runs. We will discuss more use cases of AWS Glue and Amazon EventBridge in our upcoming blogs. Stay tuned to keep getting all updates about our upcoming new blogs on AWS and relevant technologies.
Meanwhile …
Keep Exploring -> Keep Learning -> Keep Mastering
This blog is part of our effort towards building a knowledgeable and kick-ass tech community. At Workfall, we strive to provide the best tech and pay opportunities to AWS-certified talents. If you’re looking to work with global clients, build kick-ass products while making big bucks doing so, give it a shot at workfall.com/partner today.