Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. Ability to run dataflow pipeline from deployed flex python app-engine service without gcloud

    I have defined a batch python pipeline inside a flex app-engine python service.

    Once deployed on GCP, the pipeline cannot be compiled and started.
    The current workaround is to install gcloud-sdk in dockerfile of the service.

    It would be great to not install gcloud-sdk on deployed service.
    It would be great to have documentation about best practice to run python pipeline from deployed service

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  2. Beam SDK for Python: support for Triggers

    Beam Python SDK doesn't support triggers (or state, or timers)

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  3. Log BigQuery Error

    Currenlty when BigQueryIO.Write try to insert something wrong in streaming mode log contains only call stack, but not the reason of the error (like wrong format, of wrong field name):
    exception: "java.lang.IllegalArgumentException: timeout value is negative

    at java.lang.Thread.sleep(Native Method)
    
    at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:287)
    at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2446)
    at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2404)
    at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
    at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
    at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
    at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
    at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:696)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:94)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:521)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

    "

    logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"

    stage: "F17"

    job: "2016-10-061514_44-8205894049986167886"

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. 3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. Ability to re-run a failed / stopped / canceled job

    Ability to re-run a failed / stopped / canceled job

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. Dataflow templates broken with python 3

    Beam fails to stage a Dataflow template with python 3. I looks like Beam is trying to access the RuntimeValueProvider during staging causing 'not accessible' error

    Template stages fine with python 2.7

    Repo with code to reproduce the issue and stack trace: https://github.com/firemuzzy/dataflow-templates-bug-python3

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. Show avg/median per-element processing time in monitoring interface

    When selecting a transform in the Dataflow monitoring interface, you can currently see the number of elements that have been processed as well as the total execution time.

    It would be nice to be able to see the per-element processing time: either a simple average or better yet, a histogram. This would allow much easier performance monitoring.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. Support MapState from Beam API

    Support MapState from Beam API

    Currently the DataflowRunner does not support MapState for stateful DoFns

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. gradle

    I would love an example gradle project. Of course gradle is super popular in the java community so it is very odd that there is only documentation on how to use dataflow with maven.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. Display execution parameters on Job info screen

    On the main info screen for a particular job, a tab for execution parameters would be very useful for debugging and quantifying job performance.

    Pretty much the whole suite of:

    --input
    --stagingLocation
    --dataflowJobFile
    --maxNumWorkers
    --numWorkers
    --diskSizeGb
    --jobName
    --autoscalingAlgorithm

    that dataflow supports as execution parameters would be great to have to the right of "Step" on a tab called "Job".

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. 2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. Pin products on console drawer by project

    As today, when you pin on the console drawer a product (e.g.: Google BigQuery) this product is pinned independently of the selected Google project. Since different projects may use different sets of products, would be nice if the pinned products were scoped by project.

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. Ability to select an existing job and create a new one from this job (like custom job template)

    Ability to select an existing job and create a new one from this job (like custom job template)

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. Javascript

    Would be great to add Javascript as a compute language in addition to Python and Java.

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. dataprep/ Drop columns by a specific rule

    Today in order to remove columns from dataprep I need to manually remove all columns. When dealing with raw datasets I often find myself manually deleting hundreds of columns due to lack of value (e.g. 90% empty). I want to be able to define a rule that will let me delete all the columns according to a specific logical expression (e.g. delete all the columns that have more than 90% empty values)

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. Error reporting via Stackdriver using logging appenders

    Hi my idea is to add labels to dataflow errors. I am trying to add more info to the exceptions in dataflow step using slf4j and logback. I have updated logger errors to include marker text to easily identify in GCP stackdriver. I have done the following steps

    Added logback.xml to src/main/resources (classpath).
    
    Created loggingeventenhancer and enhancer class to add new labels.
    Added markers to logger error, to find the type of error in Stackdriver.
    But the logs in stackdriver doesnt have new labels (or markers) added via logging appender. I think the logback.xml is not found by the maven
    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. 1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. Codegen

    I would like to get a valid job and generate at least some of the code I would need to re-create this in python using the client API. I want to be able to get this for my historic jobs.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. Add numWorkers and autoscalingAlgorithm parameter on google provided templates?

    I'm using "Datastore to GCS text" provided Template but with high load (~8M entities) dataflow takes too long to scale. If we can provide numWorkers and autoscalingAlgorithm parameters jobs will take less time to execute.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. BUG: google-cloud-firestore not working in apache_beam[gcp] 2.16.0 works but in 2.13.0

    Running

    python pipeline --runner DataflowRunner ... --requirements_file requirements.txt

    throws an error if having

    apache_beam[gcp]==2.16.0
    google-cloud-firestore

    in requirements.txt but not if

    apache_beam[gcp]==2.13.0
    google-cloud-firestore

    Part of error:

    ModuleNotFoundError: No module named 'Cython'\n \n ----------------------------------------\nCommand "python setup.py egg_info" failed with error code 1 in /private/var/folders/h4/n9rzy8z52lqdh7sfkhr96nnw0000gn/T/pip-download-lx28dwpv/pyarrow/\n'

    See also: https://stackoverflow.com/questions/57286517/importing-google-firestore-python-client-in-apache-beam

    Where to report bugs?

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base