Presto is Facebook’s open source SQL query engine. One of its cool features is that it can combine data from multiple sources such as relational stores, HDFS, Cassandra, even streams like Kafka, and others with a single join query. This can be useful if you want to run a Spark job on data that sits in multiple places. Instead of creating separate RDDs for each data source and joining them in memory with Spark, you can create a logical view in Presto that represents your combined dataset. Then, you can point to the logical view in Presto in your Spark code and pull your data from one place.
Let’s say you want to create a DataFrame in Spark using a connection to Presto. First download Presto’s JDBC driver, which you can get here. (If you’re compiling a JAR then add Presto as a dependency instead).
In the Spark guide, it shows you how to load a DataFrame directly from the database’s table, or refer you to the JdbcRDD. Neither options will work. Unfortunately Presto’s JDBC driver does not implement prepareStatement so you will get an error like this.
Are there any alternatives? We’ll have to manually build an RDD and convert it into a DataFrame using the JDBC driver. The disadvantage to this approach is that the data has to fit in memory on the driver before it can be parallelized across many machines.
Let’s try this approach. Run Spark shell with Presto’s JDBC driver.
In Spark shell, we need to import the following.
Setting up the connection from here is the same for any JDBC driver.
It’s worth noting that even if you specify a catalog and schema here for the connection, you can still query other catalogs and schema using the fully qualified name of the table (e.g. catalog.schema.my_table), which is a way to hit your logical view.
We’re going to grab the schema of the table first. Then we’ll build a buffer to hold our data and then parallelize it across machines to create the RDD.
Note that in this code we’re only matching against four different types of data. You can add additional ones from here.
Now we can build the DataFrame using the schema and the RDD, register it as a temporary table, and execute queries.