Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

How can we improve Cloud Dataflow?

You've used all your votes and won't be able to post a new idea, but you can still search and comment on existing ideas.

There are two ways to get more votes:

  • When an admin closes an idea you've voted on, you'll get your votes back from that idea.
  • You can remove your votes from an open idea you support.
  • To see ideas you have already voted on, select the "My feedback" filter and select "My open ideas".
(thinking…)

Enter your idea and we'll search to see if someone has already suggested it.

If a similar idea already exists, you can support and comment on it.

If it doesn't exist, you can post your idea so others can support it.

Enter your idea and we'll search to see if someone has already suggested it.

  1. Patch Apache Beam's XmlSource to allow templating in Dataflow

    Background

    In our project, a Cloud Function is used to start a Dataflow pipe in batch modus to upload data to ElasticSearch. The source to the Dataflow is an XML file.

    The Dataflow template is used to upload the data into GCP.

    Problem

    Uploading of templates requires option parameters to accept parameters at runtime. This is implemented by using the ValueProvider interface to embrace the option-type.

    The class for reading XML source XmlSource did not use ValueProvider for its option parameters, we solved this by patching XmlSource and applied these changes to the class.

    Upload of dataflow template should be…

    26 votes
    Vote
    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      I agree to the terms of service
      Signed in as (Sign out)
      You have left! (?) (thinking…)
      0 comments  ·  Flag idea as inappropriate…  ·  Admin →
    • 22 votes
      Vote
      Sign in
      Check!
      (thinking…)
      Reset
      or sign in with
      • facebook
      • google
        Password icon
        I agree to the terms of service
        Signed in as (Sign out)
        You have left! (?) (thinking…)
        under review  ·  3 comments  ·  Flag idea as inappropriate…  ·  Admin →
      • scheduling dataflow pipeline code as a job in cloud in java

        How can we schedule dataflow pipeline code as a job to cloud in java??

        20 votes
        Vote
        Sign in
        Check!
        (thinking…)
        Reset
        or sign in with
        • facebook
        • google
          Password icon
          I agree to the terms of service
          Signed in as (Sign out)
          You have left! (?) (thinking…)
          0 comments  ·  Flag idea as inappropriate…  ·  Admin →
        • Ability to load data to BigQuery Partitions from dataflow pipelines (Python)

          Python Dataflow pipelines fail in _parse_table_reference function when you specify a BigQuery Table name with partition decorator for loading. This is very important aspect if you would want to leverage BigQuery Table Partitioning.

          13 votes
          Vote
          Sign in
          Check!
          (thinking…)
          Reset
          or sign in with
          • facebook
          • google
            Password icon
            I agree to the terms of service
            Signed in as (Sign out)
            You have left! (?) (thinking…)
            0 comments  ·  Flag idea as inappropriate…  ·  Admin →
          • Show Total Memory and CPU Usage Alongside Worker Graph

            It would be amazing to in addition to the worker graph be able to directly see Memory and CPU consumption. This makes it much easier to co-relate & debug different stages and also gives a bit of insight into what machine types perform most optimal. It is possible to do this now by making metrics in Stackdriver, but it's very involved especially if a super simple graph could do the trick...just like in Kubernetes/GoogleContainerEngine

            7 votes
            Vote
            Sign in
            Check!
            (thinking…)
            Reset
            or sign in with
            • facebook
            • google
              Password icon
              I agree to the terms of service
              Signed in as (Sign out)
              You have left! (?) (thinking…)
              1 comment  ·  Flag idea as inappropriate…  ·  Admin →
            • Ability to use our own kubernetes cluster as dataflow runner

              It would be good to use our own container cluster for running dataflow workers as dataflow already use kubernetes for deploying workers. This could even take into consideration of user supplied cluster's current workload and balance workers between user provided cluster and dataflow cluster.

              6 votes
              Vote
              Sign in
              Check!
              (thinking…)
              Reset
              or sign in with
              • facebook
              • google
                Password icon
                I agree to the terms of service
                Signed in as (Sign out)
                You have left! (?) (thinking…)
                0 comments  ·  Flag idea as inappropriate…  ·  Admin →
              • Show cost of current job according to the new pricing structure.

                It could be good to see the cost of the job in job view, I even thought of doing an chrome extension for this cause it's pretty trivial with datas vCPU sec, RAM MB sec, PD MB sec etc.

                4 votes
                Vote
                Sign in
                Check!
                (thinking…)
                Reset
                or sign in with
                • facebook
                • google
                  Password icon
                  I agree to the terms of service
                  Signed in as (Sign out)
                  You have left! (?) (thinking…)
                  0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                • scala friendly api / library

                  Current darkjh/scalaflow library is pretty basic and the DoFn etc is pretty messy. It would be nice if scala was natively supported.

                  4 votes
                  Vote
                  Sign in
                  Check!
                  (thinking…)
                  Reset
                  or sign in with
                  • facebook
                  • google
                    Password icon
                    I agree to the terms of service
                    Signed in as (Sign out)
                    You have left! (?) (thinking…)
                    1 comment  ·  Flag idea as inappropriate…  ·  Admin →
                  • API for GO

                    It would be nice if GO was natively supported.

                    4 votes
                    Vote
                    Sign in
                    Check!
                    (thinking…)
                    Reset
                    or sign in with
                    • facebook
                    • google
                      Password icon
                      I agree to the terms of service
                      Signed in as (Sign out)
                      You have left! (?) (thinking…)
                      0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                    • Add the ability to sort jobs by status (e.g. running vs closed)

                      I would like to be able to quickly see the number of jobs that are currently running. Sometimes streaming jobs that have been running for weeks get buried below batch or testing jobs.

                      4 votes
                      Vote
                      Sign in
                      Check!
                      (thinking…)
                      Reset
                      or sign in with
                      • facebook
                      • google
                        Password icon
                        I agree to the terms of service
                        Signed in as (Sign out)
                        You have left! (?) (thinking…)
                        0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                      • Example Code is Maddeningly Incomplete

                        The example code provided is maddeningly incomplete. The biggest issue I have is that things like the complete code for the default templates is not provided. I want to create a slight variation of the Pub/Sub -> BigQuery example template, but I can't find the code for that template anywhere. It would be nice if that code were available so that I could base a custom dataflow job off of it. This would provide a known working example of exactly what I want from which to build on.

                        4 votes
                        Vote
                        Sign in
                        Check!
                        (thinking…)
                        Reset
                        or sign in with
                        • facebook
                        • google
                          Password icon
                          I agree to the terms of service
                          Signed in as (Sign out)
                          You have left! (?) (thinking…)
                          0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                        • Improve cost effectiveness of autoscaling algorithm

                          I'm currently processing log data from multiple days with Cloud Dataflow. According to the defined options it uses 10 to 100 workers and the throughput-based autoscaling algorithm. At the moment there are still 64 workers active, while only one job is still running with around 1500 elements per second. If you look at the CPU graph of the workers you see, that almost all of them are idle for the last 30 minutes. I would prefer a more carefree autoscaling, where I know I always get the optimal cost effectiveness.

                          4 votes
                          Vote
                          Sign in
                          Check!
                          (thinking…)
                          Reset
                          or sign in with
                          • facebook
                          • google
                            Password icon
                            I agree to the terms of service
                            Signed in as (Sign out)
                            You have left! (?) (thinking…)
                            1 comment  ·  Flag idea as inappropriate…  ·  Admin →

                            Hi Max,

                            We’ve done a few performance optimizations lately that should result in a much improved experience. Could you share a jobID for us to take a look at? (I’m curious to examine the experience you describe).

                            Thanks!

                          • Link directly to logs

                            It would be nice to have a direct link to the logs of the job from the overview page.

                            At the moment, you have to:

                            -Click the job
                            -Wait for it to load
                            -Click "Logs"
                            -Click "Worker Logs"
                            -Wait for it to load

                            It should be just one click to get the logs :-)

                            4 votes
                            Vote
                            Sign in
                            Check!
                            (thinking…)
                            Reset
                            or sign in with
                            • facebook
                            • google
                              Password icon
                              I agree to the terms of service
                              Signed in as (Sign out)
                              You have left! (?) (thinking…)
                              0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                            • Ability to run dataflow pipeline from deployed flex python app-engine service without gcloud

                              I have defined a batch python pipeline inside a flex app-engine python service.

                              Once deployed on GCP, the pipeline cannot be compiled and started.
                              The current workaround is to install gcloud-sdk in dockerfile of the service.

                              It would be great to not install gcloud-sdk on deployed service.
                              It would be great to have documentation about best practice to run python pipeline from deployed service

                              3 votes
                              Vote
                              Sign in
                              Check!
                              (thinking…)
                              Reset
                              or sign in with
                              • facebook
                              • google
                                Password icon
                                I agree to the terms of service
                                Signed in as (Sign out)
                                You have left! (?) (thinking…)
                                0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                              • Add Dataflow Job Logs to Cloud Logging API

                                Dataflow Job Logs are separate from Cloud Logging, so you cannot see Job Logs under Cloud Logging, nor create a Stackdriver alert for failed Dataflow jobs.

                                https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline

                                3 votes
                                Vote
                                Sign in
                                Check!
                                (thinking…)
                                Reset
                                or sign in with
                                • facebook
                                • google
                                  Password icon
                                  I agree to the terms of service
                                  Signed in as (Sign out)
                                  You have left! (?) (thinking…)
                                  0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                                • Log BigQuery Error

                                  Currenlty when BigQueryIO.Write try to insert something wrong in streaming mode log contains only call stack, but not the reason of the error (like wrong format, of wrong field name):
                                  exception: "java.lang.IllegalArgumentException: timeout value is negative
                                  at java.lang.Thread.sleep(Native Method)
                                  at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:287)
                                  at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2446)
                                  at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2404)
                                  at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
                                  at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
                                  at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
                                  at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:65)
                                  at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:80)
                                  at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:696)
                                  at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:94)
                                  at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:521)
                                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                                  at java.lang.Thread.run(Thread.java:745)
                                  "
                                  logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"
                                  stage: "F17"
                                  job: "2016-10-06_15_14_44-8205894049986167886"

                                  3 votes
                                  Vote
                                  Sign in
                                  Check!
                                  (thinking…)
                                  Reset
                                  or sign in with
                                  • facebook
                                  • google
                                    Password icon
                                    I agree to the terms of service
                                    Signed in as (Sign out)
                                    You have left! (?) (thinking…)
                                    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                                  • 3 votes
                                    Vote
                                    Sign in
                                    Check!
                                    (thinking…)
                                    Reset
                                    or sign in with
                                    • facebook
                                    • google
                                      Password icon
                                      I agree to the terms of service
                                      Signed in as (Sign out)
                                      You have left! (?) (thinking…)
                                      0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                                    • Show avg/median per-element processing time in monitoring interface

                                      When selecting a transform in the Dataflow monitoring interface, you can currently see the number of elements that have been processed as well as the total execution time.

                                      It would be nice to be able to see the per-element processing time: either a simple average or better yet, a histogram. This would allow much easier performance monitoring.

                                      3 votes
                                      Vote
                                      Sign in
                                      Check!
                                      (thinking…)
                                      Reset
                                      or sign in with
                                      • facebook
                                      • google
                                        Password icon
                                        I agree to the terms of service
                                        Signed in as (Sign out)
                                        You have left! (?) (thinking…)
                                        0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                                      • Labeling Dataflow jobs

                                        Should be able to assign labels to data flow jobs and filter by labels in the overview page

                                        2 votes
                                        Vote
                                        Sign in
                                        Check!
                                        (thinking…)
                                        Reset
                                        or sign in with
                                        • facebook
                                        • google
                                          Password icon
                                          I agree to the terms of service
                                          Signed in as (Sign out)
                                          You have left! (?) (thinking…)
                                          0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                                        • Display execution parameters on Job info screen

                                          On the main info screen for a particular job, a tab for execution parameters would be very useful for debugging and quantifying job performance.

                                          Pretty much the whole suite of:

                                          --input
                                          --stagingLocation
                                          --dataflowJobFile
                                          --maxNumWorkers
                                          --numWorkers
                                          --diskSizeGb
                                          --jobName
                                          --autoscalingAlgorithm

                                          that dataflow supports as execution parameters would be great to have to the right of "Step" on a tab called "Job".

                                          2 votes
                                          Vote
                                          Sign in
                                          Check!
                                          (thinking…)
                                          Reset
                                          or sign in with
                                          • facebook
                                          • google
                                            Password icon
                                            I agree to the terms of service
                                            Signed in as (Sign out)
                                            You have left! (?) (thinking…)
                                            0 comments  ·  Flag idea as inappropriate…  ·  Admin →
                                          ← Previous 1
                                          • Don't see your idea?

                                          Cloud Dataflow

                                          Feedback and Knowledge Base