Crail can be used to increase performance or enhance flexibility in Apache Spark. We provide multiple plugins to allow Crail to be used as:
The Crail HDFS adapter is provided with every Crail deployment. The HDFS adpater allows to replace every HDFS path with a path on Crail. However for it to be used for input and output in Spark the jar file paths have to be added to the Spark configuration spark-defaults.conf:
spark.driver.extraClassPath $CRAIL_HOME/jars/* spark.executor.extraClassPath $CRAIL_HOME/jars/*
Data in Crail can be accessed by prepending the value of
from crail-site.conf to any HDFS path. For example
/test in Crail.
Note that Crail works independent of HDFS and does not interact with HDFS in
any way. However Crail does not completely replace HDFS since we do not offer
durability and fault tolerance cf. Introduction.
A good fit for Crail is for example inter-job data that can be recomputed
from the original data in HDFS.
Crail-Spark-IO contains various I/O accleration plugins for Spark tailored to high-performance network and storage hardware (RDMA, NVMef, etc.). Spark-IO is not provided with the default Crail deployment but can be obtained here. Spark-IO currently contains two IO plugins: a shuffle engine and a broadcast module. Both plugins inherit all the benefits of Crail such as very high performance (throughput and latency) and multi-tiering (e.g., DRAM and flash).
- Spark >= 2.0
- Java 8
- Crail >= 1.0
To build Crail execute the following steps:
To configure the crail shuffle plugin add the following lines to spark-defaults.conf
spark.shuffle.manager org.apache.spark.shuffle.crail.CrailShuffleManager spark.driver.extraClassPath $CRAIL_HOME/jars/*:<path>/crail-spark-X.Y.jar:. spark.executor.extraClassPath $CRAIL_HOME/jars/*:<path>/crail-spark-X.Y.jar:.
Since Spark version 2.0.0, broadcast is no longer an exchangeable plugin, unfortunately. To use the Crail broadcast plugin in Spark it has to be manually added to Spark’s BroadcastManager.scala.