Currenlty when BigQueryIO.Write try to insert something wrong in streaming mode log contains only call stack, but not the reason of the error (like wrong format, of wrong field name):
exception: "java.lang.IllegalArgumentException: timeout value is negative
at java.lang.Thread.sleep(Native Method)
job: "2016-10-061514_44-8205894049986167886"3 votes
It's been a long time since the last info about running Beam on Dataflow: https://stackoverflow.com/questions/38534686/using-beam-sdk-in-cloud-dataflow3 votes
Beam fails to stage a Dataflow template with python 3. I looks like Beam is trying to access the RuntimeValueProvider during staging causing 'not accessible' error
Template stages fine with python 2.7
Repo with code to reproduce the issue and stack trace: https://github.com/firemuzzy/dataflow-templates-bug-python33 votes
When trying to run a Dataflow driver program behind a firewall, it needs to use a proxy to connect to GCP, but there does not seem to be a way to specify that HTTPS traffic should go through a proxy.3 votes
When selecting a transform in the Dataflow monitoring interface, you can currently see the number of elements that have been processed as well as the total execution time.
It would be nice to be able to see the per-element processing time: either a simple average or better yet, a histogram. This would allow much easier performance monitoring.3 votes
Support MapState from Beam API
Currently the DataflowRunner does not support MapState for stateful DoFns3 votes
I would love an example gradle project. Of course gradle is super popular in the java community so it is very odd that there is only documentation on how to use dataflow with maven.3 votes
On the main info screen for a particular job, a tab for execution parameters would be very useful for debugging and quantifying job performance.
Pretty much the whole suite of:
that dataflow supports as execution parameters would be great to have to the right of "Step" on a tab called "Job".3 votes
Thanks for the suggestion!
I want to dump data from reading a large csv file from bucket to postgres. could you provide a code for this??2 votes
As today, when you pin on the console drawer a product (e.g.: Google BigQuery) this product is pinned independently of the selected Google project. Since different projects may use different sets of products, would be nice if the pinned products were scoped by project.2 votes
i want to be able to order the list of past dataflow jobs by start time, end time, etc in the web ui1 vote
I would like to get a valid job and generate at least some of the code I would need to re-create this in python using the client API. I want to be able to get this for my historic jobs.1 vote
I'm using "Datastore to GCS text" provided Template but with high load (~8M entities) dataflow takes too long to scale. If we can provide numWorkers and autoscalingAlgorithm parameters jobs will take less time to execute.1 vote
python pipeline --runner DataflowRunner ... --requirements_file requirements.txt
throws an error if having
in requirements.txt but not if
Part of error:
ModuleNotFoundError: No module named 'Cython'\n \n ----------------------------------------\nCommand "python setup.py egg_info" failed with error code 1 in /private/var/folders/h4/n9rzy8z52lqdh7sfkhr96nnw0000gn/T/pip-download-lx28dwpv/pyarrow/\n'
Where to report bugs?1 vote
I have to Read a File From Bucket then I Have To Dump that File Data to a BigQuery table and Then I have to Run 200 Queries in that Dumped table My Whole Process is one By one But Due to parallel Work my Work is Not Done Perfectly So i Have to Synchronize my Work Such that one Job get Finished then another get Triggered .
Can anyone Help me.1 vote
Please update the docs to describe how machine type affects jobs.
If you have a serial pipeline and don't do any native threading in your DoFn, is a n1-standard-8 going to be any faster than an n1-standard-1?
If you have parallel stages and set a max of 50 workers, will you get work done faster on a n1-standard-8 than a n1-standard-1. i.e. will use 400 cores for workers instead of 50?
[please ignore that n1-standard-8 has more ram and may help groupBy for this discussion]1 vote
The screen shots are just old and outdated so cannot be used for troubleshooting. Permissions is not on the left for example.
The error log messages for permissions fail to say which account is having access issues so you can and I have spent hours trying different random combinations. yet still have failed to get any dataflow most basic examples to work ever.1 vote
Not all parameters can be configured at runtime. The doc should provide those information to users.
For example, AWS credentials/regions are configured during template construction time. There is no way (at least for me) to change it during run-time if I want to pass in credentials during run-time.
Another case, in order to accept dynamic BigQuery queries at run-time, extra code is needed to override the default query.
What confuses me is that a query/table has to be provided during template construction otherwise BigQueryIO complains.1 vote
How to read a file from GCS and load it into Bigquery, with Python.1 vote
- Don't see your idea?