Multiplex between Iterators in TensorFlow
What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.
My current solution: Currently, I create a tf.Iterator
per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset
:
// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)
it = ds.make_one_shot_iterator()
Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:
itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()
So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.
The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.
Question:
- Another approach to tackle this problem with only 1 iterator for example?
- If not, how to speed up Iterator creation?
python tensorflow
add a comment |
What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.
My current solution: Currently, I create a tf.Iterator
per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset
:
// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)
it = ds.make_one_shot_iterator()
Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:
itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()
So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.
The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.
Question:
- Another approach to tackle this problem with only 1 iterator for example?
- If not, how to speed up Iterator creation?
python tensorflow
Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
– kvish
Nov 13 '18 at 2:02
Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
– jdehesa
Nov 13 '18 at 11:53
@jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
– Martijn Courteaux
Nov 13 '18 at 12:06
add a comment |
What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.
My current solution: Currently, I create a tf.Iterator
per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset
:
// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)
it = ds.make_one_shot_iterator()
Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:
itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()
So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.
The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.
Question:
- Another approach to tackle this problem with only 1 iterator for example?
- If not, how to speed up Iterator creation?
python tensorflow
What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.
My current solution: Currently, I create a tf.Iterator
per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset
:
// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)
it = ds.make_one_shot_iterator()
Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:
itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()
So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.
The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.
Question:
- Another approach to tackle this problem with only 1 iterator for example?
- If not, how to speed up Iterator creation?
python tensorflow
python tensorflow
asked Nov 12 '18 at 23:21
Martijn Courteaux
49k36165258
49k36165258
Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
– kvish
Nov 13 '18 at 2:02
Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
– jdehesa
Nov 13 '18 at 11:53
@jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
– Martijn Courteaux
Nov 13 '18 at 12:06
add a comment |
Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
– kvish
Nov 13 '18 at 2:02
Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
– jdehesa
Nov 13 '18 at 11:53
@jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
– Martijn Courteaux
Nov 13 '18 at 12:06
Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
– kvish
Nov 13 '18 at 2:02
Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
– kvish
Nov 13 '18 at 2:02
Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
– jdehesa
Nov 13 '18 at 11:53
Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
– jdehesa
Nov 13 '18 at 11:53
@jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
– Martijn Courteaux
Nov 13 '18 at 12:06
@jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
– Martijn Courteaux
Nov 13 '18 at 12:06
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271529%2fmultiplex-between-iterators-in-tensorflow%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271529%2fmultiplex-between-iterators-in-tensorflow%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
– kvish
Nov 13 '18 at 2:02
Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
– jdehesa
Nov 13 '18 at 11:53
@jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
– Martijn Courteaux
Nov 13 '18 at 12:06