Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. 3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. Show avg/median per-element processing time in monitoring interface

    When selecting a transform in the Dataflow monitoring interface, you can currently see the number of elements that have been processed as well as the total execution time.

    It would be nice to be able to see the per-element processing time: either a simple average or better yet, a histogram. This would allow much easier performance monitoring.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  3. Support MapState from Beam API

    Support MapState from Beam API

    Currently the DataflowRunner does not support MapState for stateful DoFns

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. gradle

    I would love an example gradle project. Of course gradle is super popular in the java community so it is very odd that there is only documentation on how to use dataflow with maven.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. Display execution parameters on Job info screen

    On the main info screen for a particular job, a tab for execution parameters would be very useful for debugging and quantifying job performance.

    Pretty much the whole suite of:

    --input
    --stagingLocation
    --dataflowJobFile
    --maxNumWorkers
    --numWorkers
    --diskSizeGb
    --jobName
    --autoscalingAlgorithm

    that dataflow supports as execution parameters would be great to have to the right of "Step" on a tab called "Job".

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. 2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. Pin products on console drawer by project

    As today, when you pin on the console drawer a product (e.g.: Google BigQuery) this product is pinned independently of the selected Google project. Since different projects may use different sets of products, would be nice if the pinned products were scoped by project.

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. Javascript

    Would be great to add Javascript as a compute language in addition to Python and Java.

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. 1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. Codegen

    I would like to get a valid job and generate at least some of the code I would need to re-create this in python using the client API. I want to be able to get this for my historic jobs.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. Add numWorkers and autoscalingAlgorithm parameter on google provided templates?

    I'm using "Datastore to GCS text" provided Template but with high load (~8M entities) dataflow takes too long to scale. If we can provide numWorkers and autoscalingAlgorithm parameters jobs will take less time to execute.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. BUG: google-cloud-firestore not working in apache_beam[gcp] 2.16.0 works but in 2.13.0

    Running

    python pipeline --runner DataflowRunner ... --requirements_file requirements.txt

    throws an error if having

    apache_beam[gcp]==2.16.0
    google-cloud-firestore

    in requirements.txt but not if

    apache_beam[gcp]==2.13.0
    google-cloud-firestore

    Part of error:

    ModuleNotFoundError: No module named \'Cython\'\n \n ----------------------------------------\nCommand "python setup.py egg_info" failed with error code 1 in /private/var/folders/h4/n9rzy8z52lqdh7sfkhr96nnw0000gn/T/pip-download-lx28dwpv/pyarrow/\n'

    See also: https://stackoverflow.com/questions/57286517/importing-google-firestore-python-client-in-apache-beam

    Where to report bugs?

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. I have to Dump Data in a table Reading a File From Bucket and Then Fire a 200 Query on that.

    I have to Read a File From Bucket then I Have To Dump that File Data to a BigQuery table and Then I have to Run 200 Queries in that Dumped table My Whole Process is one By one But Due to parallel Work my Work is Not Done Perfectly So i Have to Synchronize my Work Such that one Job get Finished then another get Triggered .
    Can anyone Help me.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. Does your worker machine type affect scheduling?

    Please update the docs to describe how machine type affects jobs.

    If you have a serial pipeline and don't do any native threading in your DoFn, is a n1-standard-8 going to be any faster than an n1-standard-1?

    If you have parallel stages and set a max of 50 workers, will you get work done faster on a n1-standard-8 than a n1-standard-1. i.e. will use 400 cores for workers instead of 50?

    [please ignore that n1-standard-8 has more ram and may help groupBy for this discussion]

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. You could update the screen shots to be the current console.

    The screen shots are just old and outdated so cannot be used for troubleshooting. Permissions is not on the left for example.

    The error log messages for permissions fail to say which account is having access issues so you can and I have spent hours trying different random combinations. yet still have failed to get any dataflow most basic examples to work ever.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. Painful and sometimes impossible to change pipeline options using Template

    Not all parameters can be configured at runtime. The doc should provide those information to users.

    For example, AWS credentials/regions are configured during template construction time. There is no way (at least for me) to change it during run-time if I want to pass in credentials during run-time.

    Another case, in order to accept dynamic BigQuery queries at run-time, extra code is needed to override the default query.
    What confuses me is that a query/table has to be provided during template construction otherwise BigQueryIO complains.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. I really need Python examples.

    How to read a file from GCS and load it into Bigquery, with Python.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. It would be great if the dataflow sdk would work with a newer version of Google Datastore than v1

    Whenever I want to use Dataflow together with Google Datastore, I have to use the old Datastore.v1 version. In this version it seems like I have to encode and decode entities of different kinds manually by extracting the Values (and knowing the type of value) and setting them on a new Object. When I compare this with newer Versions of Datastore or the NodeJs implementation, the handling of datastore-Objects is a dream (i.e. nodeJs just gives you the json-representation). Would it be possible to retrieve Entities either by "selecting" a class type like:
    MyObject mObj = entity.to(MyObject.class)
    or what would…

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. Deployment Manager template for Dataflow

    Please provide GDM template to CRUD data flow templates. Currently there is no deployment manager support for Dataflow. Since Dataflow uses GCE and GCS under the hood, a rich GDM support can be created to add tremendous value. GDM will allow mass/concurrent deployment of Dataflow templates.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. Ability to Downscale Google Provided Templates

    We have a job which the basic templated DataFlow job works fairly well for, but so far we cannot see a way to have the machine use fewer machines. Our data ingestion is large and growing, but not yet extremely large. The 3 4 vcpu 15 gig ram machines that are started to process our volume of data are very overkill. I do not see any way to use these basic templates, but to also set the max_workers setting.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base