Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. dataprep/Causing a job to fail if there is an error in the recipe

    TL;DR - I want to be able to get notified in case a recipe is not valid due to errors in the data source/recipe that can happen from a change in the source tables or any other reason, and see that the transformation has failed

    Example Scenario - I have a dataset (e.g. a salesforce table) that is being extracted on a daily basis and saved in GCS. A scheduled dataprep job is being invoked every day on the new table and transforms it with a pre-defined recipe. When a column is added/removed from the data source, the recipe might…

    10 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. scala friendly api / library

    Current darkjh/scalaflow library is pretty basic and the DoFn etc is pretty messy. It would be nice if scala was natively supported.

    8 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  3. API for GO

    It would be nice if GO was natively supported.

    6 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. Add the ability to sort jobs by status (e.g. running vs closed)

    I would like to be able to quickly see the number of jobs that are currently running. Sometimes streaming jobs that have been running for weeks get buried below batch or testing jobs.

    6 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. Is there plan for GO SDK for Dataflow? Seems Apache beam has SDK for GO now.

    I want to build pipeline using GO SDK. Java is slow for my application and I don't want waste money on hiring more CPUs to pay for slowness of Java/Python. Now that apache beam supports GO SDK, is there plan to support it in Dataflow?

    6 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. Log GCE worker setup process and exception

    We had trouble to do even simple word count job with the DataflowRunner.

    The job was stuck for 1 hour, but log said nothing.

    ==========================================
    INFO:root:2017-08-28T07:26:07.709Z: JOB_MESSAGE_BASIC: (5296cce74062ca91): Starting 1 workers in asia-northeast1-c...
    INFO:root:2017-08-28T07:26:07.731Z: JOB_MESSAGE_DEBUG: (f4c12c649707e205): Value "group/Session" materialized.
    INFO:root:2017-08-28T07:26:07.745Z: JOB_MESSAGE_BASIC: (f4c12c649707e994): Executing operation read/Read+split+pair_with_one+group/Reify+group/Write

    **** Stuck in 1 hour, so cancelled ****
    ==========================================

    The reason was we need to set [network] and [subnetwork] option explicitly and correctly.
    After that, the job was going to work.

    Its so good to know what is happening, or what is stuck the job when setting up the workers. (quota? wrong networks? etc..)

    6 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. Improve cost effectiveness of autoscaling algorithm

    I'm currently processing log data from multiple days with Cloud Dataflow. According to the defined options it uses 10 to 100 workers and the throughput-based autoscaling algorithm. At the moment there are still 64 workers active, while only one job is still running with around 1500 elements per second. If you look at the CPU graph of the workers you see, that almost all of them are idle for the last 30 minutes. I would prefer a more carefree autoscaling, where I know I always get the optimal cost effectiveness.

    5 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →

    Hi Max,

    We’ve done a few performance optimizations lately that should result in a much improved experience. Could you share a jobID for us to take a look at? (I’m curious to examine the experience you describe).

    Thanks!

  8. Wait.on() PDone instead of PCollection

    I would like to be able to publish an event to PubSub after writing data to BigQuery - currently there is Wait.on() transform which is intended to be used in this situation; howeverm Wait.on() requires a PCollection as an input to wait on, but BigQuery.IO returns a PDone. As such, I would like to be able to use Wait.On() a PDone before applying a Transform.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. pub/sub to cloud sql template

    There are lots of useful templates. One that would be useful to me is pub/sub to cloud sql.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. Is there a way to get notified whenever there is a new dataflow release?

    Can we subscribe to an email list and get notified whenever there is a new dataflow release ?

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. Automated FTP

    Would like a way to create a Read Transform that can be scheduled to upload an FTP payload to Google Storage for further processing.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. Stop a streaming pipeline when idle to save costs

    Reading from PubSub for a amount of time was possible on SDK 1.91 with maxReadTime(Duration), but now in 2.x we don't have that option anymore.

    I know that using PubSub as bounded collection is a bad design, but sometimes we have scenarios in that a few messages are pushed once a day and is impracticable to leave a streaming job in that cases.

    Like Dataproc is doing with scheduling deletion (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion) it will be nice to we have something simiiar in Dataflow to save money.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. Link directly to logs

    It would be nice to have a direct link to the logs of the job from the overview page.

    At the moment, you have to:

    -Click the job
    -Wait for it to load
    -Click "Logs"
    -Click "Worker Logs"
    -Wait for it to load

    It should be just one click to get the logs :-)

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
    planned  ·  Rafael Fernandez responded

    Thanks for the feedback!

    We are working on improving this experience along this lines. Will be happy to discuss more details next time we sync.

  14. Add a more helpful error message to "projects.jobs.create"

    I'm trying to launch a batch pipeline from outside the project and the "projects.jobs.create" API is returning:

    {
    "error": {
    "code": 400,
    "message": "Request contains an invalid argument.",
    "status": "INVALID_ARGUMENT"
    }
    }

    With no indication on which argument is invalid. This wouldn't be such a big deal, except that the documentation for this (https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/create?authuser=1) does not indicate which fields are required. :(

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. Iterative parallel processing

    Some more complex flows require one transformation being applied multiple times until a certain condition is met (like transverse a tree). Currently Dataflows do not allows to do that in a parallel way.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. Meaningful information about steps output collection in UI

    In the UI, when clicking on a step, adding output collections tag Names or OutPutTag id when available instead of "out"+index would be more meaningful.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. Ability to run dataflow pipeline from deployed flex python app-engine service without gcloud

    I have defined a batch python pipeline inside a flex app-engine python service.

    Once deployed on GCP, the pipeline cannot be compiled and started.
    The current workaround is to install gcloud-sdk in dockerfile of the service.

    It would be great to not install gcloud-sdk on deployed service.
    It would be great to have documentation about best practice to run python pipeline from deployed service

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
  18. Beam SDK for Python: support for Triggers

    Beam Python SDK doesn't support triggers (or state, or timers)

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. Log BigQuery Error

    Currenlty when BigQueryIO.Write try to insert something wrong in streaming mode log contains only call stack, but not the reason of the error (like wrong format, of wrong field name):
    exception: "java.lang.IllegalArgumentException: timeout value is negative
    at java.lang.Thread.sleep(Native Method)
    at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:287)
    at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2446)
    at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2404)
    at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
    at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
    at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
    at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
    at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:696)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:94)
    at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:521)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    "
    logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"
    stage: "F17"
    job: "2016-10-06_15_14_44-8205894049986167886"

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. 3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base