Update SOLR document without adding deleted documents
up vote
0
down vote
favorite
I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!
solr
add a comment |
up vote
0
down vote
favorite
I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!
solr
Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
– yoann
Nov 10 at 20:49
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!
solr
I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!
solr
solr
asked Nov 10 at 20:07
yoann
598
598
Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
– yoann
Nov 10 at 20:49
add a comment |
Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
– yoann
Nov 10 at 20:49
Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
– yoann
Nov 10 at 20:49
Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
– yoann
Nov 10 at 20:49
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.
thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08
1
It'll have to be added - see Customizing Merge Policies for how the section should look. ThereclaimDeletesWeight
setting should be changable through a<double name="reclaimDeletesWeight">2.0</double>
there. You tell it to use the TMP through<mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29
Thanks @MatsLindh. I updated my solrconfig.xml file adding<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">8</int>
<int name="segmentsPerTier">8</int>
<double name="reclaimDeletesWeight">10.0</double>
</mergePolicy>
. I'll update ~500k documents and see how it looks like tomorrow.
– yoann
Nov 11 at 23:12
using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.
thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08
1
It'll have to be added - see Customizing Merge Policies for how the section should look. ThereclaimDeletesWeight
setting should be changable through a<double name="reclaimDeletesWeight">2.0</double>
there. You tell it to use the TMP through<mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29
Thanks @MatsLindh. I updated my solrconfig.xml file adding<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">8</int>
<int name="segmentsPerTier">8</int>
<double name="reclaimDeletesWeight">10.0</double>
</mergePolicy>
. I'll update ~500k documents and see how it looks like tomorrow.
– yoann
Nov 11 at 23:12
using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17
add a comment |
up vote
2
down vote
accepted
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.
thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08
1
It'll have to be added - see Customizing Merge Policies for how the section should look. ThereclaimDeletesWeight
setting should be changable through a<double name="reclaimDeletesWeight">2.0</double>
there. You tell it to use the TMP through<mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29
Thanks @MatsLindh. I updated my solrconfig.xml file adding<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">8</int>
<int name="segmentsPerTier">8</int>
<double name="reclaimDeletesWeight">10.0</double>
</mergePolicy>
. I'll update ~500k documents and see how it looks like tomorrow.
– yoann
Nov 11 at 23:12
using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.
answered Nov 10 at 21:31
MatsLindh
24.3k22240
24.3k22240
thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08
1
It'll have to be added - see Customizing Merge Policies for how the section should look. ThereclaimDeletesWeight
setting should be changable through a<double name="reclaimDeletesWeight">2.0</double>
there. You tell it to use the TMP through<mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29
Thanks @MatsLindh. I updated my solrconfig.xml file adding<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">8</int>
<int name="segmentsPerTier">8</int>
<double name="reclaimDeletesWeight">10.0</double>
</mergePolicy>
. I'll update ~500k documents and see how it looks like tomorrow.
– yoann
Nov 11 at 23:12
using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17
add a comment |
thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08
1
It'll have to be added - see Customizing Merge Policies for how the section should look. ThereclaimDeletesWeight
setting should be changable through a<double name="reclaimDeletesWeight">2.0</double>
there. You tell it to use the TMP through<mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29
Thanks @MatsLindh. I updated my solrconfig.xml file adding<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">8</int>
<int name="segmentsPerTier">8</int>
<double name="reclaimDeletesWeight">10.0</double>
</mergePolicy>
. I'll update ~500k documents and see how it looks like tomorrow.
– yoann
Nov 11 at 23:12
using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17
thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08
thanks for your answer! I looked into modifying the TieredMergePolicy, in particular the reclaimDeletesWeight parameters but did not figure out how to do so. Do you know how this can be modified? Should it be present in the solorconfig.xml file by default or added to it manually?
– yoann
Nov 11 at 15:08
1
1
It'll have to be added - see Customizing Merge Policies for how the section should look. The
reclaimDeletesWeight
setting should be changable through a <double name="reclaimDeletesWeight">2.0</double>
there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29
It'll have to be added - see Customizing Merge Policies for how the section should look. The
reclaimDeletesWeight
setting should be changable through a <double name="reclaimDeletesWeight">2.0</double>
there. You tell it to use the TMP through <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> ... settings .. </mergePolicy>
– MatsLindh
Nov 11 at 21:29
Thanks @MatsLindh. I updated my solrconfig.xml file adding
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">8</int>
<int name="segmentsPerTier">8</int>
<double name="reclaimDeletesWeight">10.0</double>
</mergePolicy>
. I'll update ~500k documents and see how it looks like tomorrow.– yoann
Nov 11 at 23:12
Thanks @MatsLindh. I updated my solrconfig.xml file adding
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
<int name="maxMergeAtOnce">8</int>
<int name="segmentsPerTier">8</int>
<double name="reclaimDeletesWeight">10.0</double>
</mergePolicy>
. I'll update ~500k documents and see how it looks like tomorrow.– yoann
Nov 11 at 23:12
using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17
using reclaimDeletesWeight = 10 did not help. My update resulted in ~500k new deleted documents (~1.5% of the total index size) and ~100Gb index size increase (~30% of the index size). I'm guessing the number of deleted documents is not large enough for the merge to be triggered. How can I check if automatic merges is disabled? When using expungeDeletes, it doesn't seem to be doing a full index optimize. Is that correct? Is there a way to see which index segment has the most deleted documents and only optimize those segments? Thanks again for your help.
– yoann
Nov 12 at 19:17
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242970%2fupdate-solr-document-without-adding-deleted-documents%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Update: by doing an update with commit=True and expungeDeletes=True on rare occasions it does remove some of the deleted documents that were added (from 8295 without it to 97 with it) but it also significantly increased execution time (from 2min30s to 5mins). This is helpful but I'd prefer to not add those deleted documents to begin with.
– yoann
Nov 10 at 20:49