Text clustering/NLP [closed]

Imagine there is a column in dataset representing university. We need to classify the values, i.e. number of groups after classification should be as equal as possible to real number of universities. The problem is that there might be different naming for the same university. An example: University of Stanford = Stanford University = Uni of Stanford. Is there any certain NLP method/function/solution in Python 3?

Let's consider both cases: data might be tagged as well as untagged.

Thanks in advance.

edited Nov 17 '18 at 21:15

asked Nov 15 '18 at 7:22

BC1554

297

closed as too broad by cricket_007, usr2564301, Owen Pauling, jcubic, Janusz Nov 15 '18 at 14:16

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

– cricket_007
Nov 15 '18 at 7:36

I will try it, haven't heard of it. Thanks for the sharing.

– BC1554
Nov 17 '18 at 17:16

HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

– BC1554
Dec 6 '18 at 6:04

Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

– cricket_007
Dec 6 '18 at 14:07

@cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

– BC1554
Dec 6 '18 at 14:32

|
show 1 more comment

Let's consider both cases: data might be tagged as well as untagged.

Thanks in advance.

edited Nov 17 '18 at 21:15

asked Nov 15 '18 at 7:22

BC1554

297

closed as too broad by cricket_007, usr2564301, Owen Pauling, jcubic, Janusz Nov 15 '18 at 14:16

Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

– cricket_007
Nov 15 '18 at 7:36

I will try it, haven't heard of it. Thanks for the sharing.

– BC1554
Nov 17 '18 at 17:16

HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

– BC1554
Dec 6 '18 at 6:04

Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

– cricket_007
Dec 6 '18 at 14:07

@cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

– BC1554
Dec 6 '18 at 14:32

|
show 1 more comment

Let's consider both cases: data might be tagged as well as untagged.

Thanks in advance.

edited Nov 17 '18 at 21:15

asked Nov 15 '18 at 7:22

BC1554

297

Let's consider both cases: data might be tagged as well as untagged.

Thanks in advance.

python machine-learning nlp text-classification

edited Nov 17 '18 at 21:15

asked Nov 15 '18 at 7:22

BC1554

297

edited Nov 17 '18 at 21:15

asked Nov 15 '18 at 7:22

BC1554

297

edited Nov 17 '18 at 21:15

asked Nov 15 '18 at 7:22

BC1554

297

asked Nov 15 '18 at 7:22

BC1554

297

asked Nov 15 '18 at 7:22

BC1554

297

closed as too broad by cricket_007, usr2564301, Owen Pauling, jcubic, Janusz Nov 15 '18 at 14:16

Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

– cricket_007
Nov 15 '18 at 7:36

I will try it, haven't heard of it. Thanks for the sharing.

– BC1554
Nov 17 '18 at 17:16

HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

– BC1554
Dec 6 '18 at 6:04

Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

– cricket_007
Dec 6 '18 at 14:07

@cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

– BC1554
Dec 6 '18 at 14:32

|
show 1 more comment

Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

– cricket_007
Nov 15 '18 at 7:36

I will try it, haven't heard of it. Thanks for the sharing.

– BC1554
Nov 17 '18 at 17:16

HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

– BC1554
Dec 6 '18 at 6:04

Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

– cricket_007
Dec 6 '18 at 14:07

@cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

– BC1554
Dec 6 '18 at 14:32

Speaking of Stanford, have you heard of the CoreNLP library? Tried it?

– cricket_007
Nov 15 '18 at 7:36

I will try it, haven't heard of it. Thanks for the sharing.

– BC1554
Nov 17 '18 at 17:16

HI, I am on elasticserach, not python so it is kinda different.. I goy yo spend lots of time to find a solution... please find my problem in the comment below.

– BC1554
Dec 6 '18 at 6:04

Sure, your data is stored there. That doesn't mean you can't query for something, then use Python libraries to do something else to the data, then insert results back to Elastic

– cricket_007
Dec 6 '18 at 14:07

@cricket_007 thats an option. I'm curious how that would behave if i have 5 milion unique names from elastic and want to match them with original list of 5 thousand unviverisites names. My first idea is to try matching using n-grams (probably trigrams vs. bigrams + stemmed vs. not stemmed). What are your thoughts?

– BC1554
Dec 6 '18 at 14:32

|
show 1 more comment

1 Answer
1

active

oldest

votes

A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.

Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.

There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.

answered Nov 15 '18 at 7:44

Ivo Merchiers

631116

@BC1554 Any update on whether this approach was useful?

– Ivo Merchiers
Nov 30 '18 at 8:55

Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

– BC1554
Dec 6 '18 at 6:03

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.

answered Nov 15 '18 at 7:44

Ivo Merchiers

631116

@BC1554 Any update on whether this approach was useful?

– Ivo Merchiers
Nov 30 '18 at 8:55

Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

– BC1554
Dec 6 '18 at 6:03

add a comment |

There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.

answered Nov 15 '18 at 7:44

Ivo Merchiers

631116

@BC1554 Any update on whether this approach was useful?

– Ivo Merchiers
Nov 30 '18 at 8:55

Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

– BC1554
Dec 6 '18 at 6:03

add a comment |

There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.

answered Nov 15 '18 at 7:44

Ivo Merchiers

631116

There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.

answered Nov 15 '18 at 7:44

Ivo Merchiers

631116

answered Nov 15 '18 at 7:44

Ivo Merchiers

631116

answered Nov 15 '18 at 7:44

Ivo Merchiers

631116

answered Nov 15 '18 at 7:44

Ivo Merchiers

631116

@BC1554 Any update on whether this approach was useful?

– Ivo Merchiers
Nov 30 '18 at 8:55

Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

– BC1554
Dec 6 '18 at 6:03

add a comment |

@BC1554 Any update on whether this approach was useful?

– Ivo Merchiers
Nov 30 '18 at 8:55

Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

– BC1554
Dec 6 '18 at 6:03

@BC1554 Any update on whether this approach was useful?

– Ivo Merchiers
Nov 30 '18 at 8:55

Hi I thought I will use Python in my new job..However... I am using ElasticSearch so it is kind of different. I am trying to implement match_phrase + fuzziness but it seems like it is impossible (not that much examples online, no such cases in the documentation describer)... Anyboday has experience on phrase matching including fuzziness on ElasticSearch? Thanks in advance :)

– BC1554
Dec 6 '18 at 6:03

add a comment |

This page is only for reference, If you need detailed information, please check here

ZOQCJTeNm3X,OvBR3BY5 u,K,iZGZ G8aZbTyB75 LzYmVSqFfcLd46vWHVOkpw

搜尋此網誌

Myujth