BucketingSink capabilities for DataSet API

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

BucketingSink capabilities for DataSet API

Rafi Aroch
Hi,

I'm writing a Batch job which reads Parquet, does some aggregations and writes back as Parquet files.
I would like the output to be partitioned by year, month, day by event time. Similarly to the functionality of the BucketingSink.

I was able to achieve the reading/writing to/from Parquet by using the hadoop-compatibility features.
I couldn't find a way to partition the data by year, month, day to create a folder hierarchy accordingly. Everything is written to a single directory.


Can anyone suggest a way to achieve this? Maybe there's a way to integrate the BucketingSink with the DataSet API? Another solution?

Rafi
Reply | Threaded
Open this post in threaded view
|

Re: BucketingSink capabilities for DataSet API

Andrey Zagrebin
Hi Rafi,

At the moment I do not see any support of Parquet in DataSet API except HadoopOutputFormat, mentioned in stack overflow question. I have cc’ed Fabian and Aljoscha, maybe they could provide more information.

Best,
Andrey

On 25 Oct 2018, at 13:08, Rafi Aroch <[hidden email]> wrote:

Hi,

I'm writing a Batch job which reads Parquet, does some aggregations and writes back as Parquet files.
I would like the output to be partitioned by year, month, day by event time. Similarly to the functionality of the BucketingSink.

I was able to achieve the reading/writing to/from Parquet by using the hadoop-compatibility features.
I couldn't find a way to partition the data by year, month, day to create a folder hierarchy accordingly. Everything is written to a single directory.


Can anyone suggest a way to achieve this? Maybe there's a way to integrate the BucketingSink with the DataSet API? Another solution?

Rafi