How to use the Spark Shell (REPL)

The Spark console is a great way to run Spark code on your local machine.

You can easily create a DataFrame and play around with code in the Spark console to avoid spinning up remote servers that cost money!

Starting the console

Download Spark and run the spark-shell executable command to start the Spark console. Consoles are also known as read-eval-print loops (REPL).

I store my Spark versions in the ~/Documents/spark directory, so I can start my Spark shell with this command.

bash ~/Documents/spark/spark-2.3.0-bin-hadoop2.7/bin/spark-shell

Important variables accessible in the console

The Spark console creates a sc variable to access the SparkContext and a spark variable to access the SparkSession.

You can use the spark variable to read a CSV file on your local machine into a DataFrame.

val df = spark.read.csv("/Users/powers/Documents/tmp/data/silly_file.csv")

You can use the sc variable to convert a sequence of Row objects into a RDD:

import org.apache.spark.sql.Row

sc.parallelize(Seq(Row(1, 2, 3)))

The Spark console automatically runs import spark.implicits._ when it starts, so you have access to handy methods like toDF() and the shorthand $ syntax to create column objects. We can easily create a column object like this: $"some_column_name".

Console commands

The :quit command stops the console.

The :paste lets the user add multiple lines of code at once. Here's an example:

scala> :paste
// Entering paste mode (ctrl-D to finish)

val y = 5
val x = 10
x + y

// Exiting paste mode, now interpreting.

y: Int = 5
x: Int = 10
res8: Int = 15

The :help command lists all the available console commands. Here's a full list of all the console commands.

scala> :help
All commands can be abbreviated, e.g., :he instead of :help.
:edit <id>|<line>        edit history
:help [command]          print this summary or command-specific help
:history [num]           show the history (optional num is commands to show)
:h? <string>             search the history
:imports [name name ...] show import history, identifying sources of names
:implicits [-v]          show the implicits in scope
:javap <path|class>      disassemble a file or class name
:line <id>|<line>        place line(s) at the end of history
:load <path>             interpret lines in a file
:paste [-raw] [path]     enter paste mode or paste a file
:power                   enable power user mode
:quit                    exit the interpreter
:replay [options]        reset the repl and replay all previous commands
:require <path>          add a jar to the classpath
:reset [options]         reset the repl to its initial state, forgetting all session entries
:save <path>             save replayable session to a file
:sh <command line>       run a shell command (result is implicitly => List[String])
:settings <options>      update compiler options, if possible; see reset
:silent                  disable/enable automatic printing of results
:type [-v] <expr>        display the type of an expression without evaluating it
:kind [-v] <expr>        display the kind of expression's type
:warnings                show the suppressed warnings from the most recent line which had any

This Stackoverflow answer contains a good description of the available console commands.

Starting the console with a JAR file

The Spark console can be initiated with a JAR files as follows:

bash ~/Documents/spark/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --jars ~/Downloads/spark-daria-2.3.0_0.24.0.jar

You can download the spark-daria JAR file on this release page if you'd like to try for yourself.

Let's access the EtlDefinition class in the console to make sure that the spark-daria namespace was successfully added to the console.

scala> com.github.mrpowers.spark.daria.sql.EtlDefinition
res0: com.github.mrpowers.spark.daria.sql.EtlDefinition.type = EtlDefinition

You can add a JAR file to an existing console session with the :require command.

:require /Users/powers/Downloads/spark-daria-2.3.0_0.24.0.jar

Next steps

The Spark console is a great way to play around with Spark code on your local machine.

Try reading the Introdution to Spark DataFrames post and pasting in all the examples to a Spark console as you go. It'll be a great way to learn about the Spark console and DataFrames!