Manual Parquet generation with Kite SDK
Installation:
curl http://central.maven.org/maven2/org/kitesdk/kite-tools/1.1.0/kite-tools-1.1.0-binary.jar -o kite-dataset
chmod +x kite-dataset
Create the filesĀ“ schema:
{
"type" : "record",
"name" : "GenericCell4gHdfs",
"namespace" : "com.mydomain.mypojo",
"doc" : "Sample object",
"fields" : [ {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "value",
"type" : [ "null", "double" ],
"default" : null
}, {
"name" : "longValue",
"type" : "long"
}, {
"name" : "intValue",
"type" : "int"
} ]
}
Create the partitions`definition: With this JSON:
[ {
"name" : "year",
"source" : "timestamp",
"type" : "year"
}, {
"name" : "month",
"source" : "timestamp",
"type" : "month"
}, {
"name" : "day",
"source" : "timestamp",
"type" : "day"
} ]
And then invoke:
kite-dataset partition-config timestamp:year timestamp:month timestamp:day -s schema.txt -o partitions.txt
Create the dataset
kite-dataset create dataset:hdfs:/victor/mypojo --schema edos_schema.txt --format parquet -p partitions.txt
Finally, Import the data from a CSV
kite-dataset csv-import bigdata.csv dataset:hdfs:/victor/mypojo --use-hdfs