Simple way to do a weighted hot deck imputation in Stata?
I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):
proc surveyimpute method=hotdeck(selection=weighted);
For clarity then, the basic requirements are:
Imputations most be row-based or simultaneous. If row 1 donates
x
to row 3, then it must also donatey
.Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1
I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x
and y
then either both are missing or neither is missing. Here's some code to generate sample data.
global miss_vars "wealth income"
global weight "weight"
set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0
// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars
replace `v' = . if impute == 1
Data looks like this:
id type income wealth weight impute
1. 1 0 5000 20188.03 4 0
2. 2 0 10000 40288.81 1 0
3. 3 0 . . 1 1
4. 4 1 20000 80350.85 4 0
5. 5 1 25000 100378.8 1 0
6. 6 1 . . 1 1
So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.
For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):
3. 3 0 5000 20188.03 1 1
3. 3 0 10000 40288.81 1 1
Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.
sas stata imputation
add a comment |
I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):
proc surveyimpute method=hotdeck(selection=weighted);
For clarity then, the basic requirements are:
Imputations most be row-based or simultaneous. If row 1 donates
x
to row 3, then it must also donatey
.Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1
I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x
and y
then either both are missing or neither is missing. Here's some code to generate sample data.
global miss_vars "wealth income"
global weight "weight"
set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0
// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars
replace `v' = . if impute == 1
Data looks like this:
id type income wealth weight impute
1. 1 0 5000 20188.03 4 0
2. 2 0 10000 40288.81 1 0
3. 3 0 . . 1 1
4. 4 1 20000 80350.85 4 0
5. 5 1 25000 100378.8 1 0
6. 6 1 . . 1 1
So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.
For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):
3. 3 0 5000 20188.03 1 1
3. 3 0 10000 40288.81 1 1
Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.
sas stata imputation
3
The community-contributed commandhotdeck
might do what you want.
– Pearly Spencer
Nov 15 '18 at 16:50
add a comment |
I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):
proc surveyimpute method=hotdeck(selection=weighted);
For clarity then, the basic requirements are:
Imputations most be row-based or simultaneous. If row 1 donates
x
to row 3, then it must also donatey
.Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1
I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x
and y
then either both are missing or neither is missing. Here's some code to generate sample data.
global miss_vars "wealth income"
global weight "weight"
set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0
// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars
replace `v' = . if impute == 1
Data looks like this:
id type income wealth weight impute
1. 1 0 5000 20188.03 4 0
2. 2 0 10000 40288.81 1 0
3. 3 0 . . 1 1
4. 4 1 20000 80350.85 4 0
5. 5 1 25000 100378.8 1 0
6. 6 1 . . 1 1
So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.
For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):
3. 3 0 5000 20188.03 1 1
3. 3 0 10000 40288.81 1 1
Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.
sas stata imputation
I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):
proc surveyimpute method=hotdeck(selection=weighted);
For clarity then, the basic requirements are:
Imputations most be row-based or simultaneous. If row 1 donates
x
to row 3, then it must also donatey
.Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1
I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x
and y
then either both are missing or neither is missing. Here's some code to generate sample data.
global miss_vars "wealth income"
global weight "weight"
set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0
// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars
replace `v' = . if impute == 1
Data looks like this:
id type income wealth weight impute
1. 1 0 5000 20188.03 4 0
2. 2 0 10000 40288.81 1 0
3. 3 0 . . 1 1
4. 4 1 20000 80350.85 4 0
5. 5 1 25000 100378.8 1 0
6. 6 1 . . 1 1
So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.
For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):
3. 3 0 5000 20188.03 1 1
3. 3 0 10000 40288.81 1 1
Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.
sas stata imputation
sas stata imputation
edited Nov 29 '18 at 21:23
JohnE
asked Nov 15 '18 at 16:44
JohnEJohnE
14.3k53459
14.3k53459
3
The community-contributed commandhotdeck
might do what you want.
– Pearly Spencer
Nov 15 '18 at 16:50
add a comment |
3
The community-contributed commandhotdeck
might do what you want.
– Pearly Spencer
Nov 15 '18 at 16:50
3
3
The community-contributed command
hotdeck
might do what you want.– Pearly Spencer
Nov 15 '18 at 16:50
The community-contributed command
hotdeck
might do what you want.– Pearly Spencer
Nov 15 '18 at 16:50
add a comment |
2 Answers
2
active
oldest
votes
Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:
gen sort_order = uniform()
// save recipient rows to file, keep donors
preserve
keep if impute == 1
save recipients, replace
restore
keep if impute == 0
// prep donor cells
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// bring back recipient rows and sort entire data set
append using recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// replace missing values via a simple replace
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// extra kludge step necessary to handle top rows
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:
proc surveyimpute method=hotdeck(selection=weighted);
Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1
).
And here it the same code, with more comments:
gen sort_order = uniform()
// split off and save the recipient rows
preserve
keep if impute == 1
save recipients, replace
// restore full dataset and keep only donor rows
restore
keep if impute == 0
// set up the donor rows. the key idea here is to set up such
// that each donor row represents a probability interval where
// the ordering of the intervals in a cell in random (based on
// the variable "sort_order" and the width of the interval is
// proportional to the weight
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// append with recipients so we again have a full datasets
// with both donors and recipients
append using recipients
// now we intersperse the donors and recipients using "sort_order"
// which is based on randomness and weight for the donors and
// is purely random for the recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// fill recipient variables from donor rows. conceptually
// this is very simple. each recipient row is in within the
// range of some donor cell. in practice, that is simply
// the nearest preceding donor cell
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// however, there's a minor practical issue that recipient
// cells that are in the range of the first donor cell need
// to be filled by the nearest successive donor cell, which
// can be done by reversing the sort and then filling from
// the nearest preceding donor cell
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
add a comment |
Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):
There seem to be a couple versions:
- hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html
- whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm
As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.
The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).
I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:
Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.
It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324137%2fsimple-way-to-do-a-weighted-hot-deck-imputation-in-stata%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:
gen sort_order = uniform()
// save recipient rows to file, keep donors
preserve
keep if impute == 1
save recipients, replace
restore
keep if impute == 0
// prep donor cells
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// bring back recipient rows and sort entire data set
append using recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// replace missing values via a simple replace
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// extra kludge step necessary to handle top rows
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:
proc surveyimpute method=hotdeck(selection=weighted);
Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1
).
And here it the same code, with more comments:
gen sort_order = uniform()
// split off and save the recipient rows
preserve
keep if impute == 1
save recipients, replace
// restore full dataset and keep only donor rows
restore
keep if impute == 0
// set up the donor rows. the key idea here is to set up such
// that each donor row represents a probability interval where
// the ordering of the intervals in a cell in random (based on
// the variable "sort_order" and the width of the interval is
// proportional to the weight
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// append with recipients so we again have a full datasets
// with both donors and recipients
append using recipients
// now we intersperse the donors and recipients using "sort_order"
// which is based on randomness and weight for the donors and
// is purely random for the recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// fill recipient variables from donor rows. conceptually
// this is very simple. each recipient row is in within the
// range of some donor cell. in practice, that is simply
// the nearest preceding donor cell
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// however, there's a minor practical issue that recipient
// cells that are in the range of the first donor cell need
// to be filled by the nearest successive donor cell, which
// can be done by reversing the sort and then filling from
// the nearest preceding donor cell
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
add a comment |
Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:
gen sort_order = uniform()
// save recipient rows to file, keep donors
preserve
keep if impute == 1
save recipients, replace
restore
keep if impute == 0
// prep donor cells
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// bring back recipient rows and sort entire data set
append using recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// replace missing values via a simple replace
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// extra kludge step necessary to handle top rows
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:
proc surveyimpute method=hotdeck(selection=weighted);
Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1
).
And here it the same code, with more comments:
gen sort_order = uniform()
// split off and save the recipient rows
preserve
keep if impute == 1
save recipients, replace
// restore full dataset and keep only donor rows
restore
keep if impute == 0
// set up the donor rows. the key idea here is to set up such
// that each donor row represents a probability interval where
// the ordering of the intervals in a cell in random (based on
// the variable "sort_order" and the width of the interval is
// proportional to the weight
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// append with recipients so we again have a full datasets
// with both donors and recipients
append using recipients
// now we intersperse the donors and recipients using "sort_order"
// which is based on randomness and weight for the donors and
// is purely random for the recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// fill recipient variables from donor rows. conceptually
// this is very simple. each recipient row is in within the
// range of some donor cell. in practice, that is simply
// the nearest preceding donor cell
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// however, there's a minor practical issue that recipient
// cells that are in the range of the first donor cell need
// to be filled by the nearest successive donor cell, which
// can be done by reversing the sort and then filling from
// the nearest preceding donor cell
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
add a comment |
Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:
gen sort_order = uniform()
// save recipient rows to file, keep donors
preserve
keep if impute == 1
save recipients, replace
restore
keep if impute == 0
// prep donor cells
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// bring back recipient rows and sort entire data set
append using recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// replace missing values via a simple replace
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// extra kludge step necessary to handle top rows
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:
proc surveyimpute method=hotdeck(selection=weighted);
Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1
).
And here it the same code, with more comments:
gen sort_order = uniform()
// split off and save the recipient rows
preserve
keep if impute == 1
save recipients, replace
// restore full dataset and keep only donor rows
restore
keep if impute == 0
// set up the donor rows. the key idea here is to set up such
// that each donor row represents a probability interval where
// the ordering of the intervals in a cell in random (based on
// the variable "sort_order" and the width of the interval is
// proportional to the weight
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// append with recipients so we again have a full datasets
// with both donors and recipients
append using recipients
// now we intersperse the donors and recipients using "sort_order"
// which is based on randomness and weight for the donors and
// is purely random for the recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// fill recipient variables from donor rows. conceptually
// this is very simple. each recipient row is in within the
// range of some donor cell. in practice, that is simply
// the nearest preceding donor cell
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// however, there's a minor practical issue that recipient
// cells that are in the range of the first donor cell need
// to be filled by the nearest successive donor cell, which
// can be done by reversing the sort and then filling from
// the nearest preceding donor cell
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:
gen sort_order = uniform()
// save recipient rows to file, keep donors
preserve
keep if impute == 1
save recipients, replace
restore
keep if impute == 0
// prep donor cells
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// bring back recipient rows and sort entire data set
append using recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// replace missing values via a simple replace
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// extra kludge step necessary to handle top rows
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:
proc surveyimpute method=hotdeck(selection=weighted);
Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1
).
And here it the same code, with more comments:
gen sort_order = uniform()
// split off and save the recipient rows
preserve
keep if impute == 1
save recipients, replace
// restore full dataset and keep only donor rows
restore
keep if impute == 0
// set up the donor rows. the key idea here is to set up such
// that each donor row represents a probability interval where
// the ordering of the intervals in a cell in random (based on
// the variable "sort_order" and the width of the interval is
// proportional to the weight
sort type sort_order
by type: gen weight_sum = sum($weight)
by type: gen impute_weight = $weight / weight_sum[_N]
by type: replace impute_weight = sum(impute_weight)
drop weight_sum
// append with recipients so we again have a full datasets
// with both donors and recipients
append using recipients
// now we intersperse the donors and recipients using "sort_order"
// which is based on randomness and weight for the donors and
// is purely random for the recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order
// fill recipient variables from donor rows. conceptually
// this is very simple. each recipient row is in within the
// range of some donor cell. in practice, that is simply
// the nearest preceding donor cell
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if impute == 1
// however, there's a minor practical issue that recipient
// cells that are in the range of the first donor cell need
// to be filled by the nearest successive donor cell, which
// can be done by reversing the sort and then filling from
// the nearest preceding donor cell
gsort type sort_order
foreach v in $miss_vars
by type: replace `v' = `v'[_n-1] if `v' == .
edited Nov 19 '18 at 2:12
answered Nov 16 '18 at 18:14
JohnEJohnE
14.3k53459
14.3k53459
add a comment |
add a comment |
Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):
There seem to be a couple versions:
- hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html
- whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm
As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.
The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).
I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:
Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.
It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).
add a comment |
Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):
There seem to be a couple versions:
- hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html
- whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm
As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.
The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).
I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:
Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.
It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).
add a comment |
Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):
There seem to be a couple versions:
- hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html
- whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm
As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.
The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).
I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:
Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.
It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).
Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):
There seem to be a couple versions:
- hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html
- whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm
As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck.
Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.
The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).
I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:
Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.
It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).
edited Nov 21 '18 at 18:59
answered Nov 19 '18 at 19:30
JohnEJohnE
14.3k53459
14.3k53459
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324137%2fsimple-way-to-do-a-weighted-hot-deck-imputation-in-stata%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
The community-contributed command
hotdeck
might do what you want.– Pearly Spencer
Nov 15 '18 at 16:50