Cloud Dataflow

Welcome to the Google Cloud Dataflow idea forum. You can submit and vote on ideas here to tell the Google Cloud Dataflow team which features you’d like to see.

This forum is for feature suggestions. If you’re looking for help forums, look here:

We can’t wait to hear from you!

  1. Allow custom logger appenders

    Using a custom log appender (e.g. with logback) inside Dataflow is impossible at the moment. Any logging settings I have seem to be superseded by Google's own appender and just show up in the Dataflow logs in Stackdriver. I want to send my logs to an Elasticsearch cluster, since the rest of my logs which are generated by other non-Dataflow systems are there as well.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  2. Add a more helpful error message to "projects.jobs.create"

    I'm trying to launch a batch pipeline from outside the project and the "projects.jobs.create" API is returning:

    {
    "error": {

    "code": 400,
    
    "message": "Request contains an invalid argument.",
    "status": "INVALID_ARGUMENT"

    }
    }

    With no indication on which argument is invalid. This wouldn't be such a big deal, except that the documentation for this (https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/create?authuser=1) does not indicate which fields are required. :(

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  3. Wait.on() PDone instead of PCollection

    I would like to be able to publish an event to PubSub after writing data to BigQuery - currently there is Wait.on() transform which is intended to be used in this situation; howeverm Wait.on() requires a PCollection as an input to wait on, but BigQuery.IO returns a PDone. As such, I would like to be able to use Wait.On() a PDone before applying a Transform.

    7 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  4. delete or cancel job using the console

    Delete or cancel a job using console

    14 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  5. It would be great if the dataflow sdk would work with a newer version of Google Datastore than v1

    Whenever I want to use Dataflow together with Google Datastore, I have to use the old Datastore.v1 version. In this version it seems like I have to encode and decode entities of different kinds manually by extracting the Values (and knowing the type of value) and setting them on a new Object. When I compare this with newer Versions of Datastore or the NodeJs implementation, the handling of datastore-Objects is a dream (i.e. nodeJs just gives you the json-representation). Would it be possible to retrieve Entities either by "selecting" a class type like:
    MyObject mObj = entity.to(MyObject.class)
    or what would…

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  6. Painful and sometimes impossible to change pipeline options using Template

    Not all parameters can be configured at runtime. The doc should provide those information to users.

    For example, AWS credentials/regions are configured during template construction time. There is no way (at least for me) to change it during run-time if I want to pass in credentials during run-time.

    Another case, in order to accept dynamic BigQuery queries at run-time, extra code is needed to override the default query.
    What confuses me is that a query/table has to be provided during template construction otherwise BigQueryIO complains.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  7. gradle

    I would love an example gradle project. Of course gradle is super popular in the java community so it is very odd that there is only documentation on how to use dataflow with maven.

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  8. Beam SDK for Python: support for Triggers

    Beam Python SDK doesn't support triggers (or state, or timers)

    3 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  9. Add numWorkers and autoscalingAlgorithm parameter on google provided templates?

    I'm using "Datastore to GCS text" provided Template but with high load (~8M entities) dataflow takes too long to scale. If we can provide numWorkers and autoscalingAlgorithm parameters jobs will take less time to execute.

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  10. Bug: Apache beam DataFlow runner throwing setup error

    Hi,
    We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

    A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

    But could not find detailed worker-startup logs.

    We tried increasing memory size, worker count etc, but still getting the same error.

    Here is the command we use,
    python run.py \
    --project=xyz \
    --runner=DataflowRunner \
    --staginglocation=gs://xyz/staging \
    --temp
    location=gs://xyz/temp \
    --requirementsfile=requirements.txt \
    --worker
    machinetype n1-standard-8 \
    --num
    workers 2

    pipeline snippet

    data = pipeline | "load data" >> beam.io.Read(

    beam.io.BigQuerySource(query="SELECT *
    16 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  11. 1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  12. BigQuery to PubSub Dataflow Template

    I need a BigQuery to PubSub Template in Dataflow. This will allow us to create a streaming job to export BigQuery Gmail Logs into our SIEM using our PubSub HTTPS Endpoint.

    67 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  13. Bug: Dataflow jobs are not shown consistently on cloud console

    Currently running jobs are sometimes not shown on cloud console. After refreshing the page, they sometimes show up, just to disappear again a couple of seconds later.

    The behaviour is very inconsistent and therefore I have not found a way to replicate the issue. It seems mostly time dependent. A couple of weeks ago I noticed this issue for the first time. Today, I've been suffering from the issue the whole day, while yesterday everything was working fine.

    20 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    4 comments  ·  Flag idea as inappropriate…  ·  Admin →
  14. dataprep/ Enable to delete more the one dataset

    In the DATASETS tab on dataproc, it would be extremely helpful to check multiple datasets and delete them together

    1 vote
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  15. 50 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  16. dataprep/ Drop columns by a specific rule

    Today in order to remove columns from dataprep I need to manually remove all columns. When dealing with raw datasets I often find myself manually deleting hundreds of columns due to lack of value (e.g. 90% empty). I want to be able to define a rule that will let me delete all the columns according to a specific logical expression (e.g. delete all the columns that have more than 90% empty values)

    2 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  17. dataprep/Causing a job to fail if there is an error in the recipe

    TL;DR - I want to be able to get notified in case a recipe is not valid due to errors in the data source/recipe that can happen from a change in the source tables or any other reason, and see that the transformation has failed

    Example Scenario - I have a dataset (e.g. a salesforce table) that is being extracted on a daily basis and saved in GCS. A scheduled dataprep job is being invoked every day on the new table and transforms it with a pre-defined recipe. When a column is added/removed from the data source, the recipe might…

    12 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  18. PubSub to BigQuery template should allow Avro and/or Protobuf

    BigQuery already has support for loading files in Avro, why not streaming them in from PubSub? This seems like an obvious feature to have but I don't see any way to do it currently. The PubSub to BigQuery template is great, but it would be so much better with this one feature turned on.

    11 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  19. dataprep / dataflow jobs region

    TLDR: dataprep creates jobs in the us only. This is probably a bug.

    I'm trying to prep data from BQ - EU to BQ -
    EU.

    But dataprep creates dataflow in the us, and because of that I get an error:

    Cannot read and write in different locations: source: EU, destination: US, error: Cannot read and write in different locations: source: EU, destination: US

    15 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    2 comments  ·  Flag idea as inappropriate…  ·  Admin →
  20. Stop a streaming pipeline when idle to save costs

    Reading from PubSub for a amount of time was possible on SDK 1.91 with maxReadTime(Duration), but now in 2.x we don't have that option anymore.

    I know that using PubSub as bounded collection is a bad design, but sometimes we have scenarios in that a few messages are pushed once a day and is impracticable to leave a streaming job in that cases.

    Like Dataproc is doing with scheduling deletion (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion) it will be nice to we have something simiiar in Dataflow to save money.

    4 votes
    Vote
    Sign in
    (thinking…)
    Sign in with: Facebook Google
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    0 comments  ·  Flag idea as inappropriate…  ·  Admin →
  • Don't see your idea?

Cloud Dataflow

Categories

Feedback and Knowledge Base