Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. Log BigQuery Error

    Currenlty when BigQueryIO.Write try to insert something wrong in streaming mode log contains only call stack, but not the reason of the error (like wrong format, of wrong field name):
    exception: "java.lang.IllegalArgumentException: timeout value is negative

    at java.lang.Thread.sleep(Native Method)
    
    at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:287)
    at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2446)
    at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2404)
    at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
    at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
    at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
    at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
    at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:696)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:94)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:521)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

    "

    logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"

    stage: "F17"

    job: "2016-10-061514_44-8205894049986167886"

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. 3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  3. Dataflow templates broken with python 3

    Beam fails to stage a Dataflow template with python 3. I looks like Beam is trying to access the RuntimeValueProvider during staging causing 'not accessible' error

    Template stages fine with python 2.7

    Repo with code to reproduce the issue and stack trace: https://github.com/firemuzzy/dataflow-templates-bug-python3

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. Set Proxy Host

    When trying to run a Dataflow driver program behind a firewall, it needs to use a proxy to connect to GCP, but there does not seem to be a way to specify that HTTPS traffic should go through a proxy.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. Show avg/median per-element processing time in monitoring interface

    When selecting a transform in the Dataflow monitoring interface, you can currently see the number of elements that have been processed as well as the total execution time.

    It would be nice to be able to see the per-element processing time: either a simple average or better yet, a histogram. This would allow much easier performance monitoring.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. Support MapState from Beam API

    Support MapState from Beam API

    Currently the DataflowRunner does not support MapState for stateful DoFns

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. gradle

    I would love an example gradle project. Of course gradle is super popular in the java community so it is very odd that there is only documentation on how to use dataflow with maven.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. Display execution parameters on Job info screen

    On the main info screen for a particular job, a tab for execution parameters would be very useful for debugging and quantifying job performance.

    Pretty much the whole suite of:

    --input
    --stagingLocation
    --dataflowJobFile
    --maxNumWorkers
    --numWorkers
    --diskSizeGb
    --jobName
    --autoscalingAlgorithm

    that dataflow supports as execution parameters would be great to have to the right of "Step" on a tab called "Job".

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. 2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. Pin products on console drawer by project

    As today, when you pin on the console drawer a product (e.g.: Google BigQuery) this product is pinned independently of the selected Google project. Since different projects may use different sets of products, would be nice if the pinned products were scoped by project.

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. Javascript

    Would be great to add Javascript as a compute language in addition to Python and Java.

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. 1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. Codegen

    I would like to get a valid job and generate at least some of the code I would need to re-create this in python using the client API. I want to be able to get this for my historic jobs.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. Add numWorkers and autoscalingAlgorithm parameter on google provided templates?

    I'm using "Datastore to GCS text" provided Template but with high load (~8M entities) dataflow takes too long to scale. If we can provide numWorkers and autoscalingAlgorithm parameters jobs will take less time to execute.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. BUG: google-cloud-firestore not working in apache_beam[gcp] 2.16.0 works but in 2.13.0

    Running

    python pipeline --runner DataflowRunner ... --requirements_file requirements.txt

    throws an error if having

    apache_beam[gcp]==2.16.0
    google-cloud-firestore

    in requirements.txt but not if

    apache_beam[gcp]==2.13.0
    google-cloud-firestore

    Part of error:

    ModuleNotFoundError: No module named 'Cython'\n \n ----------------------------------------\nCommand "python setup.py egg_info" failed with error code 1 in /private/var/folders/h4/n9rzy8z52lqdh7sfkhr96nnw0000gn/T/pip-download-lx28dwpv/pyarrow/\n'

    See also: https://stackoverflow.com/questions/57286517/importing-google-firestore-python-client-in-apache-beam

    Where to report bugs?

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. I have to Dump Data in a table Reading a File From Bucket and Then Fire a 200 Query on that.

    I have to Read a File From Bucket then I Have To Dump that File Data to a BigQuery table and Then I have to Run 200 Queries in that Dumped table My Whole Process is one By one But Due to parallel Work my Work is Not Done Perfectly So i Have to Synchronize my Work Such that one Job get Finished then another get Triggered .
    Can anyone Help me.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. Does your worker machine type affect scheduling?

    Please update the docs to describe how machine type affects jobs.

    If you have a serial pipeline and don't do any native threading in your DoFn, is a n1-standard-8 going to be any faster than an n1-standard-1?

    If you have parallel stages and set a max of 50 workers, will you get work done faster on a n1-standard-8 than a n1-standard-1. i.e. will use 400 cores for workers instead of 50?

    [please ignore that n1-standard-8 has more ram and may help groupBy for this discussion]

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. You could update the screen shots to be the current console.

    The screen shots are just old and outdated so cannot be used for troubleshooting. Permissions is not on the left for example.

    The error log messages for permissions fail to say which account is having access issues so you can and I have spent hours trying different random combinations. yet still have failed to get any dataflow most basic examples to work ever.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. Painful and sometimes impossible to change pipeline options using Template

    Not all parameters can be configured at runtime. The doc should provide those information to users.

    For example, AWS credentials/regions are configured during template construction time. There is no way (at least for me) to change it during run-time if I want to pass in credentials during run-time.

    Another case, in order to accept dynamic BigQuery queries at run-time, extra code is needed to override the default query.
    What confuses me is that a query/table has to be provided during template construction otherwise BigQueryIO complains.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. I really need Python examples.

    How to read a file from GCS and load it into Bigquery, with Python.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base