Building a classification model










-1














I have some students data with 1500 rows and 30 columns. i have used GradientBoostingClassifier. All the data is categorical and in the range of 0-80 and 0-90 categories.



I need to build an prediction model to predict if the student will fail and pass. columns 'Status' is my target variable. below is the code i have used



from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data=data.apply(le.fit_transform)

X=data.copy()
y=data['Status']
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Import Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gb.predict(X_test)


below is my model performance.



Accuracy: 0.9719317419707395
Precision: 0.9090272918124562
Recall: 0.5650282731622445


Please let me know what should I do to improve the model and how to handle the wide range of categorical data. Also when I am testing the model on different dataset, most of the time the categories are changing either due to spelling mistake or new addition



std_id std_name Dem secn_id location bucket Primary_subject status
144 amy SEP 5.3 P dev english pass
230 mani SEV 11.3 E Tech math fail
299 sam DE 5.1 nap prac science pass
568 samy SEP 1.1 P prac V1 pass
769 elle SEP 1.2 pe prac english pass
761 tanj SEP 1.3 N tech V2 pass
112 jon ERM 3.0 N prac phy fail
116 pal NAN 9.1 sc etc V1V2 pass
116 pal NAN 9.2 sc etc V1V3 fail
113 josh NAN 9.3 du etc. erp fail
100 sug EVV 9.1 sc NAN che pass
323 adi ERP 3.1 NAN fit math fail
323 adi ERP 3.2 NAN fit math fail


this is how my input data looks like.For missing value i have replaced it with string "NAN". There are duplicate records for student if they have changed any option.










share|improve this question























  • accuracy 0.97 and you need to improve?
    – Dejan Marić
    Nov 13 '18 at 7:00










  • Can you add the Prec, Recall for both class
    – AI_Learning
    Nov 13 '18 at 8:19










  • though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
    – aim
    Nov 13 '18 at 8:21















-1














I have some students data with 1500 rows and 30 columns. i have used GradientBoostingClassifier. All the data is categorical and in the range of 0-80 and 0-90 categories.



I need to build an prediction model to predict if the student will fail and pass. columns 'Status' is my target variable. below is the code i have used



from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data=data.apply(le.fit_transform)

X=data.copy()
y=data['Status']
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Import Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gb.predict(X_test)


below is my model performance.



Accuracy: 0.9719317419707395
Precision: 0.9090272918124562
Recall: 0.5650282731622445


Please let me know what should I do to improve the model and how to handle the wide range of categorical data. Also when I am testing the model on different dataset, most of the time the categories are changing either due to spelling mistake or new addition



std_id std_name Dem secn_id location bucket Primary_subject status
144 amy SEP 5.3 P dev english pass
230 mani SEV 11.3 E Tech math fail
299 sam DE 5.1 nap prac science pass
568 samy SEP 1.1 P prac V1 pass
769 elle SEP 1.2 pe prac english pass
761 tanj SEP 1.3 N tech V2 pass
112 jon ERM 3.0 N prac phy fail
116 pal NAN 9.1 sc etc V1V2 pass
116 pal NAN 9.2 sc etc V1V3 fail
113 josh NAN 9.3 du etc. erp fail
100 sug EVV 9.1 sc NAN che pass
323 adi ERP 3.1 NAN fit math fail
323 adi ERP 3.2 NAN fit math fail


this is how my input data looks like.For missing value i have replaced it with string "NAN". There are duplicate records for student if they have changed any option.










share|improve this question























  • accuracy 0.97 and you need to improve?
    – Dejan Marić
    Nov 13 '18 at 7:00










  • Can you add the Prec, Recall for both class
    – AI_Learning
    Nov 13 '18 at 8:19










  • though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
    – aim
    Nov 13 '18 at 8:21













-1












-1








-1







I have some students data with 1500 rows and 30 columns. i have used GradientBoostingClassifier. All the data is categorical and in the range of 0-80 and 0-90 categories.



I need to build an prediction model to predict if the student will fail and pass. columns 'Status' is my target variable. below is the code i have used



from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data=data.apply(le.fit_transform)

X=data.copy()
y=data['Status']
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Import Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gb.predict(X_test)


below is my model performance.



Accuracy: 0.9719317419707395
Precision: 0.9090272918124562
Recall: 0.5650282731622445


Please let me know what should I do to improve the model and how to handle the wide range of categorical data. Also when I am testing the model on different dataset, most of the time the categories are changing either due to spelling mistake or new addition



std_id std_name Dem secn_id location bucket Primary_subject status
144 amy SEP 5.3 P dev english pass
230 mani SEV 11.3 E Tech math fail
299 sam DE 5.1 nap prac science pass
568 samy SEP 1.1 P prac V1 pass
769 elle SEP 1.2 pe prac english pass
761 tanj SEP 1.3 N tech V2 pass
112 jon ERM 3.0 N prac phy fail
116 pal NAN 9.1 sc etc V1V2 pass
116 pal NAN 9.2 sc etc V1V3 fail
113 josh NAN 9.3 du etc. erp fail
100 sug EVV 9.1 sc NAN che pass
323 adi ERP 3.1 NAN fit math fail
323 adi ERP 3.2 NAN fit math fail


this is how my input data looks like.For missing value i have replaced it with string "NAN". There are duplicate records for student if they have changed any option.










share|improve this question















I have some students data with 1500 rows and 30 columns. i have used GradientBoostingClassifier. All the data is categorical and in the range of 0-80 and 0-90 categories.



I need to build an prediction model to predict if the student will fail and pass. columns 'Status' is my target variable. below is the code i have used



from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data=data.apply(le.fit_transform)

X=data.copy()
y=data['Status']
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Import Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gb.predict(X_test)


below is my model performance.



Accuracy: 0.9719317419707395
Precision: 0.9090272918124562
Recall: 0.5650282731622445


Please let me know what should I do to improve the model and how to handle the wide range of categorical data. Also when I am testing the model on different dataset, most of the time the categories are changing either due to spelling mistake or new addition



std_id std_name Dem secn_id location bucket Primary_subject status
144 amy SEP 5.3 P dev english pass
230 mani SEV 11.3 E Tech math fail
299 sam DE 5.1 nap prac science pass
568 samy SEP 1.1 P prac V1 pass
769 elle SEP 1.2 pe prac english pass
761 tanj SEP 1.3 N tech V2 pass
112 jon ERM 3.0 N prac phy fail
116 pal NAN 9.1 sc etc V1V2 pass
116 pal NAN 9.2 sc etc V1V3 fail
113 josh NAN 9.3 du etc. erp fail
100 sug EVV 9.1 sc NAN che pass
323 adi ERP 3.1 NAN fit math fail
323 adi ERP 3.2 NAN fit math fail


this is how my input data looks like.For missing value i have replaced it with string "NAN". There are duplicate records for student if they have changed any option.







python machine-learning scikit-learn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 8:16







aim

















asked Nov 13 '18 at 6:36









aimaim

477




477











  • accuracy 0.97 and you need to improve?
    – Dejan Marić
    Nov 13 '18 at 7:00










  • Can you add the Prec, Recall for both class
    – AI_Learning
    Nov 13 '18 at 8:19










  • though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
    – aim
    Nov 13 '18 at 8:21
















  • accuracy 0.97 and you need to improve?
    – Dejan Marić
    Nov 13 '18 at 7:00










  • Can you add the Prec, Recall for both class
    – AI_Learning
    Nov 13 '18 at 8:19










  • though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
    – aim
    Nov 13 '18 at 8:21















accuracy 0.97 and you need to improve?
– Dejan Marić
Nov 13 '18 at 7:00




accuracy 0.97 and you need to improve?
– Dejan Marić
Nov 13 '18 at 7:00












Can you add the Prec, Recall for both class
– AI_Learning
Nov 13 '18 at 8:19




Can you add the Prec, Recall for both class
– AI_Learning
Nov 13 '18 at 8:19












though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
– aim
Nov 13 '18 at 8:21




though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
– aim
Nov 13 '18 at 8:21












2 Answers
2






active

oldest

votes


















1














Your model performance is decent.
To improve further



  1. Tune the parameteres of GradientBoostingClassifier . You can set values for parameters like n_estimators, learning_rate etc and check the performance of your model.
    For this task I'll suggest GridSearchCV


  2. Feature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using Random Forest etc and use features with high importance.


  3. You can try different algorithms like XGBoost, LightGBM or even neural network


  4. You can use cross-validator like Stratified ShuffleSplit


Regarding your next problem.



Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.






share|improve this answer




















  • thanks Akshay..!! i have added my data snippet
    – aim
    Nov 13 '18 at 8:17










  • i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
    – aim
    Nov 14 '18 at 8:23










  • You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
    – Sociopath
    Nov 14 '18 at 9:14










  • i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
    – aim
    Nov 14 '18 at 11:48


















1














First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.



Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.



Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".



Classifier as Decision Trees/Random Forest are good option when handling categorical variables.



The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.



EDIT (after the add of the data table)



Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.






share|improve this answer


















  • 1




    great. You have covered the topics that I missed in my answer :-)
    – Sociopath
    Nov 13 '18 at 7:05










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53275136%2fbuilding-a-classification-model%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














Your model performance is decent.
To improve further



  1. Tune the parameteres of GradientBoostingClassifier . You can set values for parameters like n_estimators, learning_rate etc and check the performance of your model.
    For this task I'll suggest GridSearchCV


  2. Feature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using Random Forest etc and use features with high importance.


  3. You can try different algorithms like XGBoost, LightGBM or even neural network


  4. You can use cross-validator like Stratified ShuffleSplit


Regarding your next problem.



Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.






share|improve this answer




















  • thanks Akshay..!! i have added my data snippet
    – aim
    Nov 13 '18 at 8:17










  • i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
    – aim
    Nov 14 '18 at 8:23










  • You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
    – Sociopath
    Nov 14 '18 at 9:14










  • i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
    – aim
    Nov 14 '18 at 11:48















1














Your model performance is decent.
To improve further



  1. Tune the parameteres of GradientBoostingClassifier . You can set values for parameters like n_estimators, learning_rate etc and check the performance of your model.
    For this task I'll suggest GridSearchCV


  2. Feature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using Random Forest etc and use features with high importance.


  3. You can try different algorithms like XGBoost, LightGBM or even neural network


  4. You can use cross-validator like Stratified ShuffleSplit


Regarding your next problem.



Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.






share|improve this answer




















  • thanks Akshay..!! i have added my data snippet
    – aim
    Nov 13 '18 at 8:17










  • i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
    – aim
    Nov 14 '18 at 8:23










  • You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
    – Sociopath
    Nov 14 '18 at 9:14










  • i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
    – aim
    Nov 14 '18 at 11:48













1












1








1






Your model performance is decent.
To improve further



  1. Tune the parameteres of GradientBoostingClassifier . You can set values for parameters like n_estimators, learning_rate etc and check the performance of your model.
    For this task I'll suggest GridSearchCV


  2. Feature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using Random Forest etc and use features with high importance.


  3. You can try different algorithms like XGBoost, LightGBM or even neural network


  4. You can use cross-validator like Stratified ShuffleSplit


Regarding your next problem.



Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.






share|improve this answer












Your model performance is decent.
To improve further



  1. Tune the parameteres of GradientBoostingClassifier . You can set values for parameters like n_estimators, learning_rate etc and check the performance of your model.
    For this task I'll suggest GridSearchCV


  2. Feature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using Random Forest etc and use features with high importance.


  3. You can try different algorithms like XGBoost, LightGBM or even neural network


  4. You can use cross-validator like Stratified ShuffleSplit


Regarding your next problem.



Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 13 '18 at 7:01









SociopathSociopath

3,64281635




3,64281635











  • thanks Akshay..!! i have added my data snippet
    – aim
    Nov 13 '18 at 8:17










  • i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
    – aim
    Nov 14 '18 at 8:23










  • You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
    – Sociopath
    Nov 14 '18 at 9:14










  • i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
    – aim
    Nov 14 '18 at 11:48
















  • thanks Akshay..!! i have added my data snippet
    – aim
    Nov 13 '18 at 8:17










  • i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
    – aim
    Nov 14 '18 at 8:23










  • You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
    – Sociopath
    Nov 14 '18 at 9:14










  • i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
    – aim
    Nov 14 '18 at 11:48















thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17




thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17












i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23




i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23












You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14




You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14












i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48




i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48













1














First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.



Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.



Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".



Classifier as Decision Trees/Random Forest are good option when handling categorical variables.



The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.



EDIT (after the add of the data table)



Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.






share|improve this answer


















  • 1




    great. You have covered the topics that I missed in my answer :-)
    – Sociopath
    Nov 13 '18 at 7:05















1














First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.



Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.



Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".



Classifier as Decision Trees/Random Forest are good option when handling categorical variables.



The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.



EDIT (after the add of the data table)



Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.






share|improve this answer


















  • 1




    great. You have covered the topics that I missed in my answer :-)
    – Sociopath
    Nov 13 '18 at 7:05













1












1








1






First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.



Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.



Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".



Classifier as Decision Trees/Random Forest are good option when handling categorical variables.



The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.



EDIT (after the add of the data table)



Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.






share|improve this answer














First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.



Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.



Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".



Classifier as Decision Trees/Random Forest are good option when handling categorical variables.



The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.



EDIT (after the add of the data table)



Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 13 '18 at 9:21

























answered Nov 13 '18 at 7:04









RobertoRoberto

50210




50210







  • 1




    great. You have covered the topics that I missed in my answer :-)
    – Sociopath
    Nov 13 '18 at 7:05












  • 1




    great. You have covered the topics that I missed in my answer :-)
    – Sociopath
    Nov 13 '18 at 7:05







1




1




great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05




great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53275136%2fbuilding-a-classification-model%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

政党

天津地下鉄3号線