Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. Cannot specify diskSizeGb when launching a template

    When I create a template, It's possible to specify --diskSizeGb, but, If you don't specify it, it's not possible to pass it as parameter

    4 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. Button: "copy dataflow job"

    I would like to be able to copy a dataflow job so that i can tweak the parameters and run it again without having to enter them all in manually.

    4 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  3. Log BigQuery Error

    Currenlty when BigQueryIO.Write try to insert something wrong in streaming mode log contains only call stack, but not the reason of the error (like wrong format, of wrong field name):
    exception: "java.lang.IllegalArgumentException: timeout value is negative

    at java.lang.Thread.sleep(Native Method)
    
    at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:287)
    at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2446)
    at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2404)
    at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
    at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
    at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
    at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
    at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:696)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:94)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:521)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

    "

    logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"

    stage: "F17"

    job: "2016-10-061514_44-8205894049986167886"

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. Meaningful information about steps output collection in UI

    In the UI, when clicking on a step, adding output collections tag Names or OutPutTag id when available instead of "out"+index would be more meaningful.

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. Add a more helpful error message to "projects.jobs.create"

    I'm trying to launch a batch pipeline from outside the project and the "projects.jobs.create" API is returning:

    {
    "error": {

    "code": 400,
    
    "message": "Request contains an invalid argument.",
    "status": "INVALID_ARGUMENT"

    }
    }

    With no indication on which argument is invalid. This wouldn't be such a big deal, except that the documentation for this (https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/create?authuser=1) does not indicate which fields are required. :(

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. Display execution parameters on Job info screen

    On the main info screen for a particular job, a tab for execution parameters would be very useful for debugging and quantifying job performance.

    Pretty much the whole suite of:

    --input
    --stagingLocation
    --dataflowJobFile
    --maxNumWorkers
    --numWorkers
    --diskSizeGb
    --jobName
    --autoscalingAlgorithm

    that dataflow supports as execution parameters would be great to have to the right of "Step" on a tab called "Job".

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. gradle

    I would love an example gradle project. Of course gradle is super popular in the java community so it is very odd that there is only documentation on how to use dataflow with maven.

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. Beam SDK for Python: support for Triggers

    Beam Python SDK doesn't support triggers (or state, or timers)

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. Show avg/median per-element processing time in monitoring interface

    When selecting a transform in the Dataflow monitoring interface, you can currently see the number of elements that have been processed as well as the total execution time.

    It would be nice to be able to see the per-element processing time: either a simple average or better yet, a histogram. This would allow much easier performance monitoring.

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. Ability to run dataflow pipeline from deployed flex python app-engine service without gcloud

    I have defined a batch python pipeline inside a flex app-engine python service.

    Once deployed on GCP, the pipeline cannot be compiled and started.
    The current workaround is to install gcloud-sdk in dockerfile of the service.

    It would be great to not install gcloud-sdk on deployed service.
    It would be great to have documentation about best practice to run python pipeline from deployed service

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  11. 3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. Error reporting via Stackdriver using logging appenders

    Hi my idea is to add labels to dataflow errors. I am trying to add more info to the exceptions in dataflow step using slf4j and logback. I have updated logger errors to include marker text to easily identify in GCP stackdriver. I have done the following steps

    Added logback.xml to src/main/resources (classpath).
    
    Created loggingeventenhancer and enhancer class to add new labels.
    Added markers to logger error, to find the type of error in Stackdriver.
    But the logs in stackdriver doesnt have new labels (or markers) added via logging appender. I think the logback.xml is not found by the maven
    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. Dataflow templates broken with python 3

    Beam fails to stage a Dataflow template with python 3. I looks like Beam is trying to access the RuntimeValueProvider during staging causing 'not accessible' error

    Template stages fine with python 2.7

    Repo with code to reproduce the issue and stack trace: https://github.com/firemuzzy/dataflow-templates-bug-python3

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. Iterative parallel processing

    Some more complex flows require one transformation being applied multiple times until a certain condition is met (like transverse a tree). Currently Dataflows do not allows to do that in a parallel way.

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. dataflow parameters are invalid

    Quite often when deploying Google Cloud Dataflow jobs I get this error:

    Error: googleapi: Error 400: The template parameters are invalid., badRequest

    Discovering the root cause of the error is difficult because, as far as I can tell, no other information about the error is available. Possible causes are:

    -a none-optional parameter hasn't had a value supplied for it
    -spelling error in my deployment code (which happens to be terraform - not that that is important)
    -specifying values for parameters that are not declared in my flex template specification file

    These have all caused me to receive the error message…

    3 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. dataprep/ Drop columns by a specific rule

    Today in order to remove columns from dataprep I need to manually remove all columns. When dealing with raw datasets I often find myself manually deleting hundreds of columns due to lack of value (e.g. 90% empty). I want to be able to define a rule that will let me delete all the columns according to a specific logical expression (e.g. delete all the columns that have more than 90% empty values)

    2 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. Pin products on console drawer by project

    As today, when you pin on the console drawer a product (e.g.: Google BigQuery) this product is pinned independently of the selected Google project. Since different projects may use different sets of products, would be nice if the pinned products were scoped by project.

    2 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. Javascript

    Would be great to add Javascript as a compute language in addition to Python and Java.

    2 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. 2 votes
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. Allow custom logger appenders

    Using a custom log appender (e.g. with logback) inside Dataflow is impossible at the moment. Any logging settings I have seem to be superseded by Google's own appender and just show up in the Dataflow logs in Stackdriver. I want to send my logs to an Elasticsearch cluster, since the rest of my logs which are generated by other non-Dataflow systems are there as well.

    1 vote
    Vote

    We're glad you're here

    Please sign in to leave feedback

    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base