TL;DR:Let’s consider your On Prem Cluster receives an audit file on a daily basis in CSV format through an internal or external process. And assume this audit file has some IT environment and resources usage details. You would like to perform an advanced analytics solution in Spark but you realize that you are actually limited to Apache Spark version (Say 1.5 or 1.6) in your On Prem Cluster. Now you decided to leverage latest Apache Spark version (2.2+) in Databricks to do your ETL and other analytics need such as find the total hours for each computer consumed the power on a daily basis with an visualization and finally want to convert that aggregated output to Parquet and store it in S3. As an optional step, you also want to transfer the Parquet output file back to your HDFS Cluster. For all success/failure actions, you would like to receive Email Notification(s). This is the use case.
In Short, we will be triggering a Oozie Workflow based on a file arrival in HDFS, transfer that to Amazon S3 using Distcp process, perform ETL in Databricks through a REST API and Store the output in S3, transfer the aggregated data back to HDFS using Distcp process (if required) and trigger Success/Failure Email Event Notifications.
Please see below the end-to-end approach to solve this problem using Apache Oozie and Databricks Job Clusters through a Rest API supported in Databricks.
We will be creating a Oozie Co-ordinator with a HDFS File dependency and trigger an actual Oozie Workflow which performs all the above actions mentioned in the diagram. These two workflows are controlled by a job.properties file which will be used to configure the Oozie Job in your on premise oozie server.