What is Apache Drill ?
Schema-free SQL Query Engine for Hadoop, NoSQL and cloud storage.
It is a distributed processing engine that interprets SQL. Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. ( Source – https://drill.apache.org/)
With Drill users can just query the raw data from its original place. There isn’t a need to load the data, create and maintain schemas, or transform the data before it can be processed.
Why Drill?
- You can get started in minutes
- Schema free JSON model . No ETL needed
- Query data from its original place
- Support for ANSI SQL
- Support for multiple datasources
- Extremely user and developer friendly
- Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
Installation (Mac OSX)
First things first. Lets install Apache Drill “Embedded Mode”. This requires almost no setup configurations.
Download apache-drill-1.5.0.tar.gz from one of the mirror sites :
- http://www.apache.org/dyn/closer.cgi/drill/drill-1.5.0/apache-drill-1.5.0.tar.gz
You can also download it using curl or wget :
wget http://mirrors.sonic.net/apache/drill/drill-1.5.0/apache-drill-1.5.0.tar.gz
curl -o apache-drill-1.5.0.tar.gz http://mirrors.sonic.net/apache/drill/drill-1.5.0/apache-drill-1.5.0.tar.gz
Ensure that you have JAVA_HOME set (1.7 is recommended)
Unzip and untar the downloaded file and go to bin folder :
cd /<PATH-TO>/apache-drill-1.5.0/bin
Now we start the drill shell
./drill-embedded
Easy setup as we can see. Now that we have the shell running lets run some SQL queries on a json file.
I have a some sample jsons which can be downloaded from: https://github.com/nizantz/drill-example-jsons.git
The files are placed in the sample-data folder of drill
Nishants-MacBook-Pro:sample-data nishant$ pwd /Users/nishant/drill/apache-drill-1.5.0/sample-data Nishants-MacBook-Pro:sample-data nishant$ ls -lrt drwxr-xr-x@ 3 nishant staff 102 Feb 9 15:23 regionsSF drwxr-xr-x@ 3 nishant staff 102 Feb 9 15:23 regionsMF -rwxr-xr-x@ 1 nishant staff 455 Feb 9 15:23 region.parquet drwxr-xr-x@ 3 nishant staff 102 Feb 9 15:23 nationsSF drwxr-xr-x@ 3 nishant staff 102 Feb 9 15:23 nationsMF -rwxr-xr-x@ 1 nishant staff 1210 Feb 9 15:23 nation.parquet -rw-r--r-- 1 nishant staff 487 Mar 1 13:49 positions.json -rw-r--r-- 1 nishant staff 510 Mar 1 13:49 employee.json -rw-r--r-- 1 nishant staff 160 Mar 1 13:49 departments.json -rw-r--r--@ 1 nishant staff 36 Mar 1 13:49 departments.csv Nishants-MacBook-Pro:sample-data nishant$
From the drill shell , lets query the employee.json :
SELECT * FROM dfs.`/Users/nishant/drill/apache-drill-1.5.0/sample-data/employee.json`;
From the drill shell , lets query the positions.json file:
SELECT * FROM dfs.`/Users/nishant/drill/apache-drill-1.5.0/sample-data/positions.json`;
How about getting the employee names and roles :
SELECT emp.firstName,emp.lastName, pos.roleTitle FROM dfs.`/Users/nishant/drill/apache-drill-1.5.0/sample-data/employee.json` emp , dfs.`/Users/nishant/drill/apache-drill-1.5.0/sample-data/positions.json` pos WHERE emp.roleId = pos.roleId;
So we could fire a simple join SQL between data in different JSON files.
How about querying two different datasources, like say JSON and MongoDB.
Drill supports MongoDB 3.0, providing a mongodb storage plugin to connect to MongoDB using MongoDB’s latest Java driver.
Am assuming that you already have a MongoDB setup. If no please follow instructions at https://docs.mongodb.org/manual/installation/ to install
Since Drill shell is already running, open the following URL in browser : http://localhost:8047/storage and click on “Enable” button for mongo
In Drill shell , we now will use the mongo.TEST schema. I have a departments table in the mongo schema;
Lets see the data in the departments table in MongoDB:
Now we can query the employee JSON file to get the employee names and join with departments table in Mongo to get the department names:
SELECT a.firstName||' '||a.lastName,b.deptName FROM dfs.`/Users/nishant/drill/apache-drill-1.5.0/sample-data/employee.json` a,departments b WHERE a.deptId=b.deptId;
So here we get the employee name from JSON and department name from MongoDB from a single SQL query. Isn’t that cool.
So now you can use Drill to query JSONs , join JSONs, join JSON and MongoDB.
Get going on drilling data!!
More later 🙂
[…] Singh explains how to get started with Apache […]