Solr server keeps going down while indexing (millions of docs) using Pysolr

I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?

* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.

I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.

solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
 """ Inserts records into an empty solr index which has already been created."""
 record_number = 0
 list_for_solr = 
 with open(filepath, "r") as file:
 csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
 for paper_id, paper_reference_id, context in csv_reader:
 # int, int, string
 record_number += 1
 solr_record = 
 solr_record['paper_id'] = paper_id
 solr_record['reference_id'] = reference_id
 solr_record['context'] = context
 # Chunks of 25000
 if record_number % 25000 == 0:
 list_for_solr.append(solr_record)
 try:
 solr.add(list_for_solr)
 except Exception as e:
 print(e, record_number, filepath)
 list_for_solr = 
 print(record_number)
 else:
 list_for_solr.append(solr_record)
 try:
 solr.add(list_for_solr)
 except Exception as e:
 print(e, record_number, filepath)

def create_concurrent_futures():
 """ Uses all the cores to do the parsing and inserting"""
 folderpath = '.../'
 refs_files = glob(os.path.join(folderpath, '*.txt'))
 with concurrent.futures.ProcessPoolExecutor() as executor:
 executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
 create_concurrent_futures()

I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?

asked Nov 15 '18 at 19:34

ash

1616

1

An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

– MatsLindh
Nov 15 '18 at 20:20

Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

– ash
Nov 15 '18 at 21:10

1

That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

– MatsLindh
Nov 16 '18 at 9:14

Thanks @MatsLindh for the advice. That's very helpful.

– ash
Nov 17 '18 at 2:40

add a comment |

* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.

solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
 """ Inserts records into an empty solr index which has already been created."""
 record_number = 0
 list_for_solr = 
 with open(filepath, "r") as file:
 csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
 for paper_id, paper_reference_id, context in csv_reader:
 # int, int, string
 record_number += 1
 solr_record = 
 solr_record['paper_id'] = paper_id
 solr_record['reference_id'] = reference_id
 solr_record['context'] = context
 # Chunks of 25000
 if record_number % 25000 == 0:
 list_for_solr.append(solr_record)
 try:
 solr.add(list_for_solr)
 except Exception as e:
 print(e, record_number, filepath)
 list_for_solr = 
 print(record_number)
 else:
 list_for_solr.append(solr_record)
 try:
 solr.add(list_for_solr)
 except Exception as e:
 print(e, record_number, filepath)

def create_concurrent_futures():
 """ Uses all the cores to do the parsing and inserting"""
 folderpath = '.../'
 refs_files = glob(os.path.join(folderpath, '*.txt'))
 with concurrent.futures.ProcessPoolExecutor() as executor:
 executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
 create_concurrent_futures()

asked Nov 15 '18 at 19:34

ash

1616

1

An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

– MatsLindh
Nov 15 '18 at 20:20

Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

– ash
Nov 15 '18 at 21:10

1

That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

– MatsLindh
Nov 16 '18 at 9:14

Thanks @MatsLindh for the advice. That's very helpful.

– ash
Nov 17 '18 at 2:40

add a comment |

* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.

solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
 """ Inserts records into an empty solr index which has already been created."""
 record_number = 0
 list_for_solr = 
 with open(filepath, "r") as file:
 csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
 for paper_id, paper_reference_id, context in csv_reader:
 # int, int, string
 record_number += 1
 solr_record = 
 solr_record['paper_id'] = paper_id
 solr_record['reference_id'] = reference_id
 solr_record['context'] = context
 # Chunks of 25000
 if record_number % 25000 == 0:
 list_for_solr.append(solr_record)
 try:
 solr.add(list_for_solr)
 except Exception as e:
 print(e, record_number, filepath)
 list_for_solr = 
 print(record_number)
 else:
 list_for_solr.append(solr_record)
 try:
 solr.add(list_for_solr)
 except Exception as e:
 print(e, record_number, filepath)

def create_concurrent_futures():
 """ Uses all the cores to do the parsing and inserting"""
 folderpath = '.../'
 refs_files = glob(os.path.join(folderpath, '*.txt'))
 with concurrent.futures.ProcessPoolExecutor() as executor:
 executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
 create_concurrent_futures()

asked Nov 15 '18 at 19:34

ash

1616

* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.

solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)

def insert_into_solr(filepath):
 """ Inserts records into an empty solr index which has already been created."""
 record_number = 0
 list_for_solr = 
 with open(filepath, "r") as file:
 csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
 for paper_id, paper_reference_id, context in csv_reader:
 # int, int, string
 record_number += 1
 solr_record = 
 solr_record['paper_id'] = paper_id
 solr_record['reference_id'] = reference_id
 solr_record['context'] = context
 # Chunks of 25000
 if record_number % 25000 == 0:
 list_for_solr.append(solr_record)
 try:
 solr.add(list_for_solr)
 except Exception as e:
 print(e, record_number, filepath)
 list_for_solr = 
 print(record_number)
 else:
 list_for_solr.append(solr_record)
 try:
 solr.add(list_for_solr)
 except Exception as e:
 print(e, record_number, filepath)

def create_concurrent_futures():
 """ Uses all the cores to do the parsing and inserting"""
 folderpath = '.../'
 refs_files = glob(os.path.join(folderpath, '*.txt'))
 with concurrent.futures.ProcessPoolExecutor() as executor:
 executor.map(insert_into_solr, refs_files, chunksize=1)

if __name__ == '__main__':
 create_concurrent_futures()

python ubuntu unix solr pysolr

asked Nov 15 '18 at 19:34

ash

1616

asked Nov 15 '18 at 19:34

ash

1616

asked Nov 15 '18 at 19:34

ash

1616

asked Nov 15 '18 at 19:34

ash

1616

asked Nov 15 '18 at 19:34

ash

1616

1

An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

– MatsLindh
Nov 15 '18 at 20:20

Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

– ash
Nov 15 '18 at 21:10

1

That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

– MatsLindh
Nov 16 '18 at 9:14

Thanks @MatsLindh for the advice. That's very helpful.

– ash
Nov 17 '18 at 2:40

add a comment |

1

An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

– MatsLindh
Nov 15 '18 at 20:20

Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

– ash
Nov 15 '18 at 21:10

1

That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

– MatsLindh
Nov 16 '18 at 9:14

Thanks @MatsLindh for the advice. That's very helpful.

– ash
Nov 17 '18 at 2:40

An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.

– MatsLindh
Nov 15 '18 at 20:20

Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?

– ash
Nov 15 '18 at 21:10

That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.

– MatsLindh
Nov 16 '18 at 9:14

Thanks @MatsLindh for the advice. That's very helpful.

– ash
Nov 17 '18 at 2:40

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326747%2fsolr-server-keeps-going-down-while-indexing-millions-of-docs-using-pysolr%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth