How to use the Spark Shell (REPL)
The Spark console is a great way to run Spark code on your local machine.
You can easily create a DataFrame and play around with code in the Spark console to avoid spinning up remote servers that cost money!
Starting the console
Download Spark and run the
spark-shell executable command to start the Spark console. Consoles are also known as read-eval-print loops (REPL).
I store my Spark versions in the
~/Documents/spark directory, so I can start my Spark shell with this command.
Important variables accessible in the console
The Spark console creates a
sc variable to access the
SparkContext and a
spark variable to access the
You can use the
spark variable to read a CSV file on your local machine into a DataFrame.
val df = spark.read.csv("/Users/powers/Documents/tmp/data/silly_file.csv")
You can use the
sc variable to convert a sequence of
Row objects into a RDD:
import org.apache.spark.sql.Row sc.parallelize(Seq(Row(1, 2, 3)))
The Spark console automatically runs
import spark.implicits._ when it starts, so you have access to handy methods like
toDF() and the shorthand
$ syntax to create column objects. We can easily create a column object like this:
:quit command stops the console.
:paste lets the user add multiple lines of code at once. Here's an example:
scala> :paste // Entering paste mode (ctrl-D to finish) val y = 5 val x = 10 x + y // Exiting paste mode, now interpreting. y: Int = 5 x: Int = 10 res8: Int = 15
:help command lists all the available console commands. Here's a full list of all the console commands.
scala> :help All commands can be abbreviated, e.g., :he instead of :help. :edit <id>|<line> edit history :help [command] print this summary or command-specific help :history [num] show the history (optional num is commands to show) :h? <string> search the history :imports [name name ...] show import history, identifying sources of names :implicits [-v] show the implicits in scope :javap <path|class> disassemble a file or class name :line <id>|<line> place line(s) at the end of history :load <path> interpret lines in a file :paste [-raw] [path] enter paste mode or paste a file :power enable power user mode :quit exit the interpreter :replay [options] reset the repl and replay all previous commands :require <path> add a jar to the classpath :reset [options] reset the repl to its initial state, forgetting all session entries :save <path> save replayable session to a file :sh <command line> run a shell command (result is implicitly => List[String]) :settings <options> update compiler options, if possible; see reset :silent disable/enable automatic printing of results :type [-v] <expr> display the type of an expression without evaluating it :kind [-v] <expr> display the kind of expression's type :warnings show the suppressed warnings from the most recent line which had any
This Stackoverflow answer contains a good description of the available console commands.
Starting the console with a JAR file
The Spark console can be initiated with a JAR files as follows:
bash ~/Documents/spark/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --jars ~/Downloads/spark-daria-2.3.0_0.24.0.jar
You can download the spark-daria JAR file on this release page if you'd like to try for yourself.
Let's access the
EtlDefinition class in the console to make sure that the spark-daria namespace was successfully added to the console.
scala> com.github.mrpowers.spark.daria.sql.EtlDefinition res0: com.github.mrpowers.spark.daria.sql.EtlDefinition.type = EtlDefinition
You can add a JAR file to an existing console session with the
The Spark console is a great way to play around with Spark code on your local machine.
Try reading the Introdution to Spark DataFrames post and pasting in all the examples to a Spark console as you go. It'll be a great way to learn about the Spark console and DataFrames!