January 09, 2019
Dealing with null in Spark
Spark DataFrames are filled with null values and Spark's null best practices differ from standard programming languages. This post explains how to deal with null in Spark and avoid the dreaded NullPointerException.

January 07, 2019
Emerging Markets Value ETFs
The spread between emerging market value and emerging market growth is the widest it's ever been and Rob Arnott says emerging market value stocks are a buy. This post looks at options for retail investors.

January 04, 2019
Rob Arnott December 2018 Podcast
Rob Arnott gave an excellent podcast discussing emerging markets, factor investing, performance chasig, and stock buybacks. This post recaps Arnott's podcast and provides actionable next steps.

December 22, 2018
Working with dates and times in Spark
This post shows how to create DataFrames with dates and times and how to leverage Spark's extensive API that makes working with dates / times easy.

December 10, 2018
AWS Athena and Apache Spark are Best Friends
AWS Athena and Apache Spark complement each other perfectly. This post explains how to build data lakes that with Spark that are optimized for Athena queries.

December 09, 2018
Optimizing Data Lakes for Apache Spark
This post explains the file formats, compression algorithms, file sizes, and partitioning schemes for creating data lakes that are fast to query with Apache Spark.

October 23, 2018
Fast Filtering with Spark PartitionFilters and PushedFilters
Spark can use disk partitioning to greatly improve the speed of filtering operations. This post explains how to partition files on disk with partitionBy and how to dissect physical plans.

October 21, 2018
Compacting Files
Spark runs slowly when it uses a data store that has a lot of small files. This post explains how to compact small files into bigger files so your Spark code can run faster.

October 17, 2018
Broadcast Joins
Broadcast joins are perfect for joining a large DataFrame with a small DataFrame. This post explains how to perform broadcast joins and how to investigate physical plans to optimize performance.

October 06, 2018
Deduplicating and Collapsing Records in Spark DataFrames
Data lakes often contain records that should be deduplicated or collapsed to a single row. This blog post covers deduplication functions and advanced Spark features that make collapsing records easy.

September 29, 2018
Just Enough Scala for Spark Programmers
Spark programmers only need to know a small subset of the Scala API to be productive. This blog post covers the minimal subset of Scala features that Spark programmers need to know.

September 23, 2018
Shading Dependencies in Spark Projects with SBT
SBT allows you to shade dependencies in fat JAR files. This blog post describes when shading is useful and shows you how to shade library dependies in your Spark projects.

September 19, 2018
Introduction to Spark SQL Functions
Spark provides a lot of SQL functions that make it easy to perform a variety of operations on DataFrames. This blog post explains how to use common SQL functions and how to write your own SQL functions.

September 17, 2018
Using the Spark Shell
The Spark console lets you run Spark code on your local machine. This post explains how to start the console, run key commands, and how to require JAR files.

September 09, 2018
Introduction to Spark DataFrames
DataFrames are the foundation for most of the analyses you'll perform with Spark. This post will discuss creating DataFrames, defining schemas, adding columns, and filtering rows.

September 01, 2018
Limiting Order Dependencies in Spark Functions
Spark transformations often depend on columns that are added by other transformations, thus creating an order dependency. This post will discuss tactics that reduce order dependencies in transformation libraries.

July 07, 2018
Incrementally Updating Extracts with Spark
Spark Structured Streaming and Trigger.Once allow you to create data lake extracts that are automatically partitioned and update incrementally. This technology can save you a lot of time an money.

May 21, 2018
Publishing Spark Projects with JitPack
JitPack is a package repository that provides easy access to Spark projects and is a great alternative to Maven. This episode will show you how to publish both public and private projects in JitPack.

May 15, 2018
Running Logistic Regressions with Spark
Logistic regression models are a powerful way to predict binary outcomes. This episode will demonstrate how to train a logistic regression model, test the model, and assess the accuracy of the model.

May 08, 2018
Environment Specific Config in Spark Scala Projects
Environment config files return different values for the test, development, staging, and production environments. This episode will show you how to add environment configuration to your Spark projects.

May 02, 2018
Building Spark JAR Files with SBT
Spark JAR files let you package code in a GitHub repository and run it on a cluster. This episode will demonstrate how to build JAR files with SBT and how to customize the code that's included in JAR files.

April 06, 2018
Advanced String Matching with Spark's rlike Method
The Spark rlike method allows for powerful SQL REGEXP pattern matching. This episode shows how to make simple rlike matches and then dives into techniques for defining multiple match criteria in a CSV file.

April 04, 2018
Speaking Slack Notifications from Spark
Speaking Slack notifications from Spark applications is a great way to keep stakeholders up-to-date with important notifications. This blog post also shows how to run Spark jobs via Slack commands.

April 02, 2018
Testing Spark Applications with uTest
uTest is the 'essential test framework' for the Scala programming language and provides an elegant interface for writing tests. This blog post shows how to test Spark functions with the uTest framework.