Multiplex between Iterators in TensorFlow










2














What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.



My current solution: Currently, I create a tf.Iterator per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset:



// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)

it = ds.make_one_shot_iterator()


Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:



itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()


So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.



The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.



Question:



  1. Another approach to tackle this problem with only 1 iterator for example?

  2. If not, how to speed up Iterator creation?









share|improve this question





















  • Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
    – kvish
    Nov 13 '18 at 2:02











  • Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
    – jdehesa
    Nov 13 '18 at 11:53










  • @jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
    – Martijn Courteaux
    Nov 13 '18 at 12:06















2














What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.



My current solution: Currently, I create a tf.Iterator per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset:



// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)

it = ds.make_one_shot_iterator()


Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:



itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()


So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.



The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.



Question:



  1. Another approach to tackle this problem with only 1 iterator for example?

  2. If not, how to speed up Iterator creation?









share|improve this question





















  • Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
    – kvish
    Nov 13 '18 at 2:02











  • Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
    – jdehesa
    Nov 13 '18 at 11:53










  • @jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
    – Martijn Courteaux
    Nov 13 '18 at 12:06













2












2








2







What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.



My current solution: Currently, I create a tf.Iterator per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset:



// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)

it = ds.make_one_shot_iterator()


Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:



itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()


So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.



The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.



Question:



  1. Another approach to tackle this problem with only 1 iterator for example?

  2. If not, how to speed up Iterator creation?









share|improve this question













What I need, functionally: My dataset is partitioned in blocks, and each block sits in a binary file.
I have an algorithm that also operates on blocks to reduce computational complexity, and then merges the results together, after visiting all blocks. It's important to have a single minibatch of data to originate from a single block, and to know which block exactly to be able to pass some parameters into the graph specific to that particular block. The next iteration, when starting again at block 0, the next minibatch from all blocks should be used. Blocks can have non-equal lengths and should be forever repeating.



My current solution: Currently, I create a tf.Iterator per block (i.e.: per file), created with a tf.data.FixedLengthRecordDataset:



// for every file:
ds = tf.dataFixedLengthRecordDataset(...)
ds = ds.repeat()
ds = ds.batch(...)
ds = ds.map(...)
ds = ds.prefetch(buffer_size=1)

it = ds.make_one_shot_iterator()


Then I have an "master" iterator that multiplexes between the file-level iterators. This is done through:



itr_handle = tf.placeholder(tf.string, shape=())
master_itr = tf.data.Iterator.from_string_handle(itr_handle, output_types)
master_next = master_itr.get_next()


So each time the graph is executed, I pass into the placeholder the string handle of the respective iterator I want to use for this execution. This way every file-level iterator still has its own state; so when the same block-file is asked for the next minibatch, it returns effectively the next minibatch, instead of reopening the file, and simply returning the first minibatch again.



The problem: Creating the file-level iterators is slow. It takes at least 200ms to create an Iterator per file. The dataset I use can easily contain up to 100 block-files, which causes TensorFlow/Python to be sitting there making these Iterator objects and graph nodes for 20 seconds, not actually processing any data.



Question:



  1. Another approach to tackle this problem with only 1 iterator for example?

  2. If not, how to speed up Iterator creation?






python tensorflow






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 12 '18 at 23:21









Martijn Courteaux

49k36165258




49k36165258











  • Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
    – kvish
    Nov 13 '18 at 2:02











  • Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
    – jdehesa
    Nov 13 '18 at 11:53










  • @jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
    – Martijn Courteaux
    Nov 13 '18 at 12:06
















  • Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
    – kvish
    Nov 13 '18 at 2:02











  • Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
    – jdehesa
    Nov 13 '18 at 11:53










  • @jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
    – Martijn Courteaux
    Nov 13 '18 at 12:06















Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
– kvish
Nov 13 '18 at 2:02





Looks like you can use the tf.data api, specifically interleave or parallel_interleave You would create a single dataset, interleaving data in according to the ordering you need, and then you batch it and have a single iterator iterate through it.
– kvish
Nov 13 '18 at 2:02













Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
– jdehesa
Nov 13 '18 at 11:53




Do you need to explicitly control which block to pick a batch from on every graph run, or is this just something you added as part of your current solution?
– jdehesa
Nov 13 '18 at 11:53












@jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
– Martijn Courteaux
Nov 13 '18 at 12:06




@jdehesa: I want every block exactly once, then merge the results. Then I can repeat the whole process. While processing block by block, I need to know which block to configure the parameters in the algorithm, specific to that block.
– Martijn Courteaux
Nov 13 '18 at 12:06












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271529%2fmultiplex-between-iterators-in-tensorflow%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53271529%2fmultiplex-between-iterators-in-tensorflow%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

ReactJS Fetched API data displays live - need Data displayed static

政党