Simple way to do a weighted hot deck imputation in Stata?










0















I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):



proc surveyimpute method=hotdeck(selection=weighted); 


For clarity then, the basic requirements are:



  1. Imputations most be row-based or simultaneous. If row 1 donates x to row 3, then it must also donate y.


  2. Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1


I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x and y then either both are missing or neither is missing. Here's some code to generate sample data.



global miss_vars "wealth income"
global weight "weight"

set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0

// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars
replace `v' = . if impute == 1



Data looks like this:



 id type income wealth weight impute
1. 1 0 5000 20188.03 4 0
2. 2 0 10000 40288.81 1 0
3. 3 0 . . 1 1
4. 4 1 20000 80350.85 4 0
5. 5 1 25000 100378.8 1 0
6. 6 1 . . 1 1


So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.



For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):



 3. 3 0 5000 20188.03 1 1
3. 3 0 10000 40288.81 1 1


Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.










share|improve this question



















  • 3





    The community-contributed command hotdeck might do what you want.

    – Pearly Spencer
    Nov 15 '18 at 16:50















0















I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):



proc surveyimpute method=hotdeck(selection=weighted); 


For clarity then, the basic requirements are:



  1. Imputations most be row-based or simultaneous. If row 1 donates x to row 3, then it must also donate y.


  2. Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1


I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x and y then either both are missing or neither is missing. Here's some code to generate sample data.



global miss_vars "wealth income"
global weight "weight"

set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0

// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars
replace `v' = . if impute == 1



Data looks like this:



 id type income wealth weight impute
1. 1 0 5000 20188.03 4 0
2. 2 0 10000 40288.81 1 0
3. 3 0 . . 1 1
4. 4 1 20000 80350.85 4 0
5. 5 1 25000 100378.8 1 0
6. 6 1 . . 1 1


So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.



For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):



 3. 3 0 5000 20188.03 1 1
3. 3 0 10000 40288.81 1 1


Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.










share|improve this question



















  • 3





    The community-contributed command hotdeck might do what you want.

    – Pearly Spencer
    Nov 15 '18 at 16:50













0












0








0








I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):



proc surveyimpute method=hotdeck(selection=weighted); 


For clarity then, the basic requirements are:



  1. Imputations most be row-based or simultaneous. If row 1 donates x to row 3, then it must also donate y.


  2. Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1


I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x and y then either both are missing or neither is missing. Here's some code to generate sample data.



global miss_vars "wealth income"
global weight "weight"

set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0

// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars
replace `v' = . if impute == 1



Data looks like this:



 id type income wealth weight impute
1. 1 0 5000 20188.03 4 0
2. 2 0 10000 40288.81 1 0
3. 3 0 . . 1 1
4. 4 1 20000 80350.85 4 0
5. 5 1 25000 100378.8 1 0
6. 6 1 . . 1 1


So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.



For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):



 3. 3 0 5000 20188.03 1 1
3. 3 0 10000 40288.81 1 1


Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.










share|improve this question
















I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):



proc surveyimpute method=hotdeck(selection=weighted); 


For clarity then, the basic requirements are:



  1. Imputations most be row-based or simultaneous. If row 1 donates x to row 3, then it must also donate y.


  2. Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1


I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x and y then either both are missing or neither is missing. Here's some code to generate sample data.



global miss_vars "wealth income"
global weight "weight"

set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0

// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars
replace `v' = . if impute == 1



Data looks like this:



 id type income wealth weight impute
1. 1 0 5000 20188.03 4 0
2. 2 0 10000 40288.81 1 0
3. 3 0 . . 1 1
4. 4 1 20000 80350.85 4 0
5. 5 1 25000 100378.8 1 0
6. 6 1 . . 1 1


So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.



For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):



 3. 3 0 5000 20188.03 1 1
3. 3 0 10000 40288.81 1 1


Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.







sas stata imputation






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 29 '18 at 21:23







JohnE

















asked Nov 15 '18 at 16:44









JohnEJohnE

14.3k53459




14.3k53459







  • 3





    The community-contributed command hotdeck might do what you want.

    – Pearly Spencer
    Nov 15 '18 at 16:50












  • 3





    The community-contributed command hotdeck might do what you want.

    – Pearly Spencer
    Nov 15 '18 at 16:50







3




3





The community-contributed command hotdeck might do what you want.

– Pearly Spencer
Nov 15 '18 at 16:50





The community-contributed command hotdeck might do what you want.

– Pearly Spencer
Nov 15 '18 at 16:50












2 Answers
2






active

oldest

votes


















0














Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:



gen sort_order = uniform()

// save recipient rows to file, keep donors
preserve
keep if impute == 1
save recipients, replace
restore
keep if impute == 0

// prep donor cells
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum

// bring back recipient rows and sort entire data set
append using recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order

// replace missing values via a simple replace
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1


// extra kludge step necessary to handle top rows
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .



This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:



proc surveyimpute method=hotdeck(selection=weighted);


Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1).



And here it the same code, with more comments:



gen sort_order = uniform()

// split off and save the recipient rows
preserve
keep if impute == 1
save recipients, replace

// restore full dataset and keep only donor rows
restore
keep if impute == 0

// set up the donor rows. the key idea here is to set up such
// that each donor row represents a probability interval where
// the ordering of the intervals in a cell in random (based on
// the variable "sort_order" and the width of the interval is
// proportional to the weight
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum

// append with recipients so we again have a full datasets
// with both donors and recipients
append using recipients

// now we intersperse the donors and recipients using "sort_order"
// which is based on randomness and weight for the donors and
// is purely random for the recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order

// fill recipient variables from donor rows. conceptually
// this is very simple. each recipient row is in within the
// range of some donor cell. in practice, that is simply
// the nearest preceding donor cell
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1


// however, there's a minor practical issue that recipient
// cells that are in the range of the first donor cell need
// to be filled by the nearest successive donor cell, which
// can be done by reversing the sort and then filling from
// the nearest preceding donor cell
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .






share|improve this answer
































    0














    Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):



    There seem to be a couple versions:



    • hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html

    • whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm

    As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
    Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.



    The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).



    I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:




    Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.



    It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).







    share|improve this answer
























      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324137%2fsimple-way-to-do-a-weighted-hot-deck-imputation-in-stata%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0














      Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:



      gen sort_order = uniform()

      // save recipient rows to file, keep donors
      preserve
      keep if impute == 1
      save recipients, replace
      restore
      keep if impute == 0

      // prep donor cells
      sort type sort_order
      by type: gen weight_sum = sum($weight)
      by type: gen impute_weight = $weight / weight_sum[_N]
      by type: replace impute_weight = sum(impute_weight)
      drop weight_sum

      // bring back recipient rows and sort entire data set
      append using recipients
      replace sort_order = impute_weight if impute_weight != .
      gsort type -sort_order

      // replace missing values via a simple replace
      foreach v in $miss_vars
      by type: replace `v' = `v'[_n-1] if impute == 1


      // extra kludge step necessary to handle top rows
      gsort type sort_order
      foreach v in $miss_vars
      by type: replace `v' = `v'[_n-1] if `v' == .



      This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:



      proc surveyimpute method=hotdeck(selection=weighted);


      Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1).



      And here it the same code, with more comments:



      gen sort_order = uniform()

      // split off and save the recipient rows
      preserve
      keep if impute == 1
      save recipients, replace

      // restore full dataset and keep only donor rows
      restore
      keep if impute == 0

      // set up the donor rows. the key idea here is to set up such
      // that each donor row represents a probability interval where
      // the ordering of the intervals in a cell in random (based on
      // the variable "sort_order" and the width of the interval is
      // proportional to the weight
      sort type sort_order
      by type: gen weight_sum = sum($weight)
      by type: gen impute_weight = $weight / weight_sum[_N]
      by type: replace impute_weight = sum(impute_weight)
      drop weight_sum

      // append with recipients so we again have a full datasets
      // with both donors and recipients
      append using recipients

      // now we intersperse the donors and recipients using "sort_order"
      // which is based on randomness and weight for the donors and
      // is purely random for the recipients
      replace sort_order = impute_weight if impute_weight != .
      gsort type -sort_order

      // fill recipient variables from donor rows. conceptually
      // this is very simple. each recipient row is in within the
      // range of some donor cell. in practice, that is simply
      // the nearest preceding donor cell
      foreach v in $miss_vars
      by type: replace `v' = `v'[_n-1] if impute == 1


      // however, there's a minor practical issue that recipient
      // cells that are in the range of the first donor cell need
      // to be filled by the nearest successive donor cell, which
      // can be done by reversing the sort and then filling from
      // the nearest preceding donor cell
      gsort type sort_order
      foreach v in $miss_vars
      by type: replace `v' = `v'[_n-1] if `v' == .






      share|improve this answer





























        0














        Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:



        gen sort_order = uniform()

        // save recipient rows to file, keep donors
        preserve
        keep if impute == 1
        save recipients, replace
        restore
        keep if impute == 0

        // prep donor cells
        sort type sort_order
        by type: gen weight_sum = sum($weight)
        by type: gen impute_weight = $weight / weight_sum[_N]
        by type: replace impute_weight = sum(impute_weight)
        drop weight_sum

        // bring back recipient rows and sort entire data set
        append using recipients
        replace sort_order = impute_weight if impute_weight != .
        gsort type -sort_order

        // replace missing values via a simple replace
        foreach v in $miss_vars
        by type: replace `v' = `v'[_n-1] if impute == 1


        // extra kludge step necessary to handle top rows
        gsort type sort_order
        foreach v in $miss_vars
        by type: replace `v' = `v'[_n-1] if `v' == .



        This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:



        proc surveyimpute method=hotdeck(selection=weighted);


        Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1).



        And here it the same code, with more comments:



        gen sort_order = uniform()

        // split off and save the recipient rows
        preserve
        keep if impute == 1
        save recipients, replace

        // restore full dataset and keep only donor rows
        restore
        keep if impute == 0

        // set up the donor rows. the key idea here is to set up such
        // that each donor row represents a probability interval where
        // the ordering of the intervals in a cell in random (based on
        // the variable "sort_order" and the width of the interval is
        // proportional to the weight
        sort type sort_order
        by type: gen weight_sum = sum($weight)
        by type: gen impute_weight = $weight / weight_sum[_N]
        by type: replace impute_weight = sum(impute_weight)
        drop weight_sum

        // append with recipients so we again have a full datasets
        // with both donors and recipients
        append using recipients

        // now we intersperse the donors and recipients using "sort_order"
        // which is based on randomness and weight for the donors and
        // is purely random for the recipients
        replace sort_order = impute_weight if impute_weight != .
        gsort type -sort_order

        // fill recipient variables from donor rows. conceptually
        // this is very simple. each recipient row is in within the
        // range of some donor cell. in practice, that is simply
        // the nearest preceding donor cell
        foreach v in $miss_vars
        by type: replace `v' = `v'[_n-1] if impute == 1


        // however, there's a minor practical issue that recipient
        // cells that are in the range of the first donor cell need
        // to be filled by the nearest successive donor cell, which
        // can be done by reversing the sort and then filling from
        // the nearest preceding donor cell
        gsort type sort_order
        foreach v in $miss_vars
        by type: replace `v' = `v'[_n-1] if `v' == .






        share|improve this answer



























          0












          0








          0







          Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:



          gen sort_order = uniform()

          // save recipient rows to file, keep donors
          preserve
          keep if impute == 1
          save recipients, replace
          restore
          keep if impute == 0

          // prep donor cells
          sort type sort_order
          by type: gen weight_sum = sum($weight)
          by type: gen impute_weight = $weight / weight_sum[_N]
          by type: replace impute_weight = sum(impute_weight)
          drop weight_sum

          // bring back recipient rows and sort entire data set
          append using recipients
          replace sort_order = impute_weight if impute_weight != .
          gsort type -sort_order

          // replace missing values via a simple replace
          foreach v in $miss_vars
          by type: replace `v' = `v'[_n-1] if impute == 1


          // extra kludge step necessary to handle top rows
          gsort type sort_order
          foreach v in $miss_vars
          by type: replace `v' = `v'[_n-1] if `v' == .



          This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:



          proc surveyimpute method=hotdeck(selection=weighted);


          Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1).



          And here it the same code, with more comments:



          gen sort_order = uniform()

          // split off and save the recipient rows
          preserve
          keep if impute == 1
          save recipients, replace

          // restore full dataset and keep only donor rows
          restore
          keep if impute == 0

          // set up the donor rows. the key idea here is to set up such
          // that each donor row represents a probability interval where
          // the ordering of the intervals in a cell in random (based on
          // the variable "sort_order" and the width of the interval is
          // proportional to the weight
          sort type sort_order
          by type: gen weight_sum = sum($weight)
          by type: gen impute_weight = $weight / weight_sum[_N]
          by type: replace impute_weight = sum(impute_weight)
          drop weight_sum

          // append with recipients so we again have a full datasets
          // with both donors and recipients
          append using recipients

          // now we intersperse the donors and recipients using "sort_order"
          // which is based on randomness and weight for the donors and
          // is purely random for the recipients
          replace sort_order = impute_weight if impute_weight != .
          gsort type -sort_order

          // fill recipient variables from donor rows. conceptually
          // this is very simple. each recipient row is in within the
          // range of some donor cell. in practice, that is simply
          // the nearest preceding donor cell
          foreach v in $miss_vars
          by type: replace `v' = `v'[_n-1] if impute == 1


          // however, there's a minor practical issue that recipient
          // cells that are in the range of the first donor cell need
          // to be filled by the nearest successive donor cell, which
          // can be done by reversing the sort and then filling from
          // the nearest preceding donor cell
          gsort type sort_order
          foreach v in $miss_vars
          by type: replace `v' = `v'[_n-1] if `v' == .






          share|improve this answer















          Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:



          gen sort_order = uniform()

          // save recipient rows to file, keep donors
          preserve
          keep if impute == 1
          save recipients, replace
          restore
          keep if impute == 0

          // prep donor cells
          sort type sort_order
          by type: gen weight_sum = sum($weight)
          by type: gen impute_weight = $weight / weight_sum[_N]
          by type: replace impute_weight = sum(impute_weight)
          drop weight_sum

          // bring back recipient rows and sort entire data set
          append using recipients
          replace sort_order = impute_weight if impute_weight != .
          gsort type -sort_order

          // replace missing values via a simple replace
          foreach v in $miss_vars
          by type: replace `v' = `v'[_n-1] if impute == 1


          // extra kludge step necessary to handle top rows
          gsort type sort_order
          foreach v in $miss_vars
          by type: replace `v' = `v'[_n-1] if `v' == .



          This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:



          proc surveyimpute method=hotdeck(selection=weighted);


          Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1).



          And here it the same code, with more comments:



          gen sort_order = uniform()

          // split off and save the recipient rows
          preserve
          keep if impute == 1
          save recipients, replace

          // restore full dataset and keep only donor rows
          restore
          keep if impute == 0

          // set up the donor rows. the key idea here is to set up such
          // that each donor row represents a probability interval where
          // the ordering of the intervals in a cell in random (based on
          // the variable "sort_order" and the width of the interval is
          // proportional to the weight
          sort type sort_order
          by type: gen weight_sum = sum($weight)
          by type: gen impute_weight = $weight / weight_sum[_N]
          by type: replace impute_weight = sum(impute_weight)
          drop weight_sum

          // append with recipients so we again have a full datasets
          // with both donors and recipients
          append using recipients

          // now we intersperse the donors and recipients using "sort_order"
          // which is based on randomness and weight for the donors and
          // is purely random for the recipients
          replace sort_order = impute_weight if impute_weight != .
          gsort type -sort_order

          // fill recipient variables from donor rows. conceptually
          // this is very simple. each recipient row is in within the
          // range of some donor cell. in practice, that is simply
          // the nearest preceding donor cell
          foreach v in $miss_vars
          by type: replace `v' = `v'[_n-1] if impute == 1


          // however, there's a minor practical issue that recipient
          // cells that are in the range of the first donor cell need
          // to be filled by the nearest successive donor cell, which
          // can be done by reversing the sort and then filling from
          // the nearest preceding donor cell
          gsort type sort_order
          foreach v in $miss_vars
          by type: replace `v' = `v'[_n-1] if `v' == .







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 19 '18 at 2:12

























          answered Nov 16 '18 at 18:14









          JohnEJohnE

          14.3k53459




          14.3k53459























              0














              Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):



              There seem to be a couple versions:



              • hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html

              • whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm

              As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
              Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.



              The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).



              I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:




              Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.



              It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).







              share|improve this answer





























                0














                Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):



                There seem to be a couple versions:



                • hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html

                • whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm

                As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
                Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.



                The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).



                I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:




                Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.



                It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).







                share|improve this answer



























                  0












                  0








                  0







                  Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):



                  There seem to be a couple versions:



                  • hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html

                  • whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm

                  As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
                  Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.



                  The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).



                  I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:




                  Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.



                  It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).







                  share|improve this answer















                  Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):



                  There seem to be a couple versions:



                  • hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html

                  • whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm

                  As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
                  Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.



                  The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).



                  I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:




                  Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.



                  It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).








                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 21 '18 at 18:59

























                  answered Nov 19 '18 at 19:30









                  JohnEJohnE

                  14.3k53459




                  14.3k53459



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324137%2fsimple-way-to-do-a-weighted-hot-deck-imputation-in-stata%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Top Tejano songwriter Luis Silva dead of heart attack at 64

                      政党

                      天津地下鉄3号線