Spark

Crail can be used to increase performance or enhance flexibility in Apache Spark. We provide multiple plugins to allow Crail to be used as:

HDFS Adapter

The Crail HDFS adapter is provided with every Crail deployment. The HDFS adpater allows to replace every HDFS path with a path on Crail. However for it to be used for input and output in Spark the jar file paths have to be added to the Spark configuration spark-defaults.conf:

spark.driver.extraClassPath      $CRAIL_HOME/jars/*
spark.executor.extraClassPath    $CRAIL_HOME/jars/*

Data in Crail can be accessed by prepending the value of crail.namenode.address from crail-site.conf to any HDFS path. For example crail://localhost:9060/test accesses /test in Crail. Note that Crail works independent of HDFS and does not interact with HDFS in any way. However Crail does not completely replace HDFS since we do not offer durability and fault tolerance cf. Introduction. A good fit for Crail is for example inter-job data that can be recomputed from the original data in HDFS.

Spark-IO

Crail-Spark-IO contains various I/O accleration plugins for Spark tailored to high-performance network and storage hardware (RDMA, NVMef, etc.). Spark-IO is not provided with the default Crail deployment but can be obtained here. Spark-IO currently contains two IO plugins: a shuffle engine and a broadcast module. Both plugins inherit all the benefits of Crail such as very high performance (throughput and latency) and multi-tiering (e.g., DRAM and flash).

Requirements

  • Spark >= 2.0
  • Java 8
  • Maven
  • Crail >= 1.0

Building

To build Crail execute the following steps:

  1. Obtain a copy of Crail-Spark-IO from Github
  2. Make sure your local maven repository contains Crail, if not build Crail from source
  3. Run: mvn -DskipTests install

Configure Spark

To configure the crail shuffle plugin add the following lines to spark-defaults.conf

spark.shuffle.manager           org.apache.spark.shuffle.crail.CrailShuffleManager

spark.driver.extraClassPath     $CRAIL_HOME/jars/*:<path>/crail-spark-X.Y.jar:.
spark.executor.extraClassPath   $CRAIL_HOME/jars/*:<path>/crail-spark-X.Y.jar:.

Since Spark version 2.0.0, broadcast is no longer an exchangeable plugin, unfortunately. To use the Crail broadcast plugin in Spark it has to be manually added to Spark’s BroadcastManager.scala.

Crail-TeraSort

SQL