Rumble 1.1 -- switched to DataFrames, twice as fast
August 8, 2019
Submitted by Ghislain Fourny.
We are happy to announce the release of Rumble 1.1 beta, the JSONiq engine that queries heterogeneous and nested JSON data on top of Apache Spark.
Until version 1.0, FLWOR expressions were mapped to Spark RDDs.
We managed to remap FLWOR expressions to DataFrames, while preserving intact support for heterogeneous data. The result is Rumble 1.1, with a notable performance improvement: twice as fast for grouping and sorting.
From the user's perspective, nothing changes — except the speed.
These are a few examples of use cases that show how the JSONiq syntax (95% inherited from XQuery) is as compact as SQL, but seamlessly deals with heterogeneity and nestedness:
1. How many persons in my dataset? count(json-file("persons.json"))
2. What are all the cities they come from?
distinct-values(json-file("persons.json").addresses[].city)
3. How many persons in each country?
for $i json-file("persons.json")
group by $c := $i.country
return {
"Country" : $c,
"Number" : count($i)
}
If you want to try it out (no need for a cluster, it also spreads computation on your local cores), you can download it for free (open source) here:
http://rumbledb.org/