2.0 Apache DistCp

2.0 Apache Distcp

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying and can be extended to Amazon S3. It uses Map Jobs to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Using distcp, a cluster of many members can copy lots of data quickly. Distcp process is best suggested for speeds up to 2 GB/s for a transfer of 5 TB. Distcp is more secured data transfer tool which enables your On Prem security features such as Kerberos, SSL, etc.

hadoop distcp -m 20 /user/apps/databricks/<your_data_directory> s3a://<bucket-name>/distcp_files/ -mapredSslConf <ssl_conf_file>

results for ""

    No results matching ""