Update SOLR document without adding deleted documents









up vote
0
down vote

favorite












I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).



I'm able to remove all deleted document by doing an optimize




curl http://localhost:8983/solr/core_name/update?optimize=true




But this takes hours to run and requires a lot of RAM and disk space.



Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?



Thanks for your help!










share|improve this question





















  • Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
    – yoann
    Nov 10 at 20:49














up vote
0
down vote

favorite












I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).



I'm able to remove all deleted document by doing an optimize




curl http://localhost:8983/solr/core_name/update?optimize=true




But this takes hours to run and requires a lot of RAM and disk space.



Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?



Thanks for your help!










share|improve this question





















  • Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
    – yoann
    Nov 10 at 20:49












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).



I'm able to remove all deleted document by doing an optimize




curl http://localhost:8983/solr/core_name/update?optimize=true




But this takes hours to run and requires a lot of RAM and disk space.



Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?



Thanks for your help!










share|improve this question













I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).



I'm able to remove all deleted document by doing an optimize




curl http://localhost:8983/solr/core_name/update?optimize=true




But this takes hours to run and requires a lot of RAM and disk space.



Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?



Thanks for your help!







solr






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 at 20:07









yoann

598




598











  • Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
    – yoann
    Nov 10 at 20:49
















  • Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
    – yoann
    Nov 10 at 20:49















Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
– yoann
Nov 10 at 20:49




Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
– yoann
Nov 10 at 20:49












1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.



When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.



How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.



If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.






share|improve this answer




















  • thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
    – yoann
    Nov 11 at 15:08







  • 1




    It'll have to be added - see Customizing Merge Policies for how the section should look. The reclaimDeletesWeight setting should be changable through a <double name="reclaimDeletesWeight">2.0</double> there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
    – MatsLindh
    Nov 11 at 21:29










  • Thanks @MatsLindh. I updated my solrconfig.xml file adding <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">8</int> <int name="segmentsPerTier">8</int> <double name="reclaimDeletesWeight">10.0</double> </mergePolicy>. I'll update ~500k documents and see how it looks like tomorrow.
    – yoann
    Nov 11 at 23:12











  • using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
    – yoann
    Nov 12 at 19:17










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242970%2fupdate-solr-document-without-adding-deleted-documents%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.



When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.



How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.



If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.






share|improve this answer




















  • thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
    – yoann
    Nov 11 at 15:08







  • 1




    It'll have to be added - see Customizing Merge Policies for how the section should look. The reclaimDeletesWeight setting should be changable through a <double name="reclaimDeletesWeight">2.0</double> there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
    – MatsLindh
    Nov 11 at 21:29










  • Thanks @MatsLindh. I updated my solrconfig.xml file adding <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">8</int> <int name="segmentsPerTier">8</int> <double name="reclaimDeletesWeight">10.0</double> </mergePolicy>. I'll update ~500k documents and see how it looks like tomorrow.
    – yoann
    Nov 11 at 23:12











  • using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
    – yoann
    Nov 12 at 19:17














up vote
2
down vote



accepted










Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.



When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.



How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.



If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.






share|improve this answer




















  • thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
    – yoann
    Nov 11 at 15:08







  • 1




    It'll have to be added - see Customizing Merge Policies for how the section should look. The reclaimDeletesWeight setting should be changable through a <double name="reclaimDeletesWeight">2.0</double> there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
    – MatsLindh
    Nov 11 at 21:29










  • Thanks @MatsLindh. I updated my solrconfig.xml file adding <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">8</int> <int name="segmentsPerTier">8</int> <double name="reclaimDeletesWeight">10.0</double> </mergePolicy>. I'll update ~500k documents and see how it looks like tomorrow.
    – yoann
    Nov 11 at 23:12











  • using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
    – yoann
    Nov 12 at 19:17












up vote
2
down vote



accepted







up vote
2
down vote



accepted






Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.



When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.



How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.



If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.






share|improve this answer












Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.



When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.



How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.



If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 10 at 21:31









MatsLindh

24.3k22240




24.3k22240











  • thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
    – yoann
    Nov 11 at 15:08







  • 1




    It'll have to be added - see Customizing Merge Policies for how the section should look. The reclaimDeletesWeight setting should be changable through a <double name="reclaimDeletesWeight">2.0</double> there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
    – MatsLindh
    Nov 11 at 21:29










  • Thanks @MatsLindh. I updated my solrconfig.xml file adding <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">8</int> <int name="segmentsPerTier">8</int> <double name="reclaimDeletesWeight">10.0</double> </mergePolicy>. I'll update ~500k documents and see how it looks like tomorrow.
    – yoann
    Nov 11 at 23:12











  • using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
    – yoann
    Nov 12 at 19:17
















  • thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
    – yoann
    Nov 11 at 15:08







  • 1




    It'll have to be added - see Customizing Merge Policies for how the section should look. The reclaimDeletesWeight setting should be changable through a <double name="reclaimDeletesWeight">2.0</double> there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
    – MatsLindh
    Nov 11 at 21:29










  • Thanks @MatsLindh. I updated my solrconfig.xml file adding <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">8</int> <int name="segmentsPerTier">8</int> <double name="reclaimDeletesWeight">10.0</double> </mergePolicy>. I'll update ~500k documents and see how it looks like tomorrow.
    – yoann
    Nov 11 at 23:12











  • using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
    – yoann
    Nov 12 at 19:17















thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08





thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08





1




1




It'll have to be added - see Customizing Merge Policies for how the section should look. The reclaimDeletesWeight setting should be changable through a <double name="reclaimDeletesWeight">2.0</double> there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29




It'll have to be added - see Customizing Merge Policies for how the section should look. The reclaimDeletesWeight setting should be changable through a <double name="reclaimDeletesWeight">2.0</double> there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29












Thanks @MatsLindh. I updated my solrconfig.xml file adding <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">8</int> <int name="segmentsPerTier">8</int> <double name="reclaimDeletesWeight">10.0</double> </mergePolicy>. I'll update ~500k documents and see how it looks like tomorrow.
– yoann
Nov 11 at 23:12





Thanks @MatsLindh. I updated my solrconfig.xml file adding <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> <int name="maxMergeAtOnce">8</int> <int name="segmentsPerTier">8</int> <double name="reclaimDeletesWeight">10.0</double> </mergePolicy>. I'll update ~500k documents and see how it looks like tomorrow.
– yoann
Nov 11 at 23:12













using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17




using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242970%2fupdate-solr-document-without-adding-deleted-documents%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

政党

天津地下鉄3号線