Big Data Industry experts understand that there are many workflow scheduling apps such as Apache Airflow, Azkaban, etc are being incubated or developed in the Apache community for their various types of data pipeline workflow automation. However everyone also learnt that Apache Oozie is one of the stable workflow scheduler data engineers are using for production workloads in their on premise big data clusters. There are many cases where data experts would like to leverage Cloud capabilities along with their Data Center which is also known as Hybrid Deployment across the big data industry. In this blog, we are going to learn about how to do the data transfer between your On Prem Hadoop Cluster and Amazon S3 and further learn how to do advanced data analytics in Databricks Cloud and integrate the data pipelines using Apache Oozie with an End to End approach.
In this book I have explained how we will be triggering a Oozie Workflow based on a file arrival in HDFS, transfer that to Amazon S3 using Distcp process, perform ETL in Databricks through a REST API and Store the output in S3, transfer the aggregated data back to HDFS using Distcp process (if required) and trigger Success/Failure Email Event Notifications.