Building a classification model
I have some students data with 1500 rows and 30 columns. i have used GradientBoostingClassifier
. All the data is categorical and in the range of 0-80 and 0-90 categories.
I need to build an prediction model to predict if the student will fail and pass. columns 'Status' is my target variable. below is the code i have used
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data=data.apply(le.fit_transform)
X=data.copy()
y=data['Status']
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Import Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = gb.predict(X_test)
below is my model performance.
Accuracy: 0.9719317419707395
Precision: 0.9090272918124562
Recall: 0.5650282731622445
Please let me know what should I do to improve the model and how to handle the wide range of categorical data. Also when I am testing the model on different dataset, most of the time the categories are changing either due to spelling mistake or new addition
std_id std_name Dem secn_id location bucket Primary_subject status
144 amy SEP 5.3 P dev english pass
230 mani SEV 11.3 E Tech math fail
299 sam DE 5.1 nap prac science pass
568 samy SEP 1.1 P prac V1 pass
769 elle SEP 1.2 pe prac english pass
761 tanj SEP 1.3 N tech V2 pass
112 jon ERM 3.0 N prac phy fail
116 pal NAN 9.1 sc etc V1V2 pass
116 pal NAN 9.2 sc etc V1V3 fail
113 josh NAN 9.3 du etc. erp fail
100 sug EVV 9.1 sc NAN che pass
323 adi ERP 3.1 NAN fit math fail
323 adi ERP 3.2 NAN fit math fail
this is how my input data looks like.For missing value i have replaced it with string "NAN". There are duplicate records for student if they have changed any option.
python machine-learning scikit-learn
add a comment |
I have some students data with 1500 rows and 30 columns. i have used GradientBoostingClassifier
. All the data is categorical and in the range of 0-80 and 0-90 categories.
I need to build an prediction model to predict if the student will fail and pass. columns 'Status' is my target variable. below is the code i have used
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data=data.apply(le.fit_transform)
X=data.copy()
y=data['Status']
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Import Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = gb.predict(X_test)
below is my model performance.
Accuracy: 0.9719317419707395
Precision: 0.9090272918124562
Recall: 0.5650282731622445
Please let me know what should I do to improve the model and how to handle the wide range of categorical data. Also when I am testing the model on different dataset, most of the time the categories are changing either due to spelling mistake or new addition
std_id std_name Dem secn_id location bucket Primary_subject status
144 amy SEP 5.3 P dev english pass
230 mani SEV 11.3 E Tech math fail
299 sam DE 5.1 nap prac science pass
568 samy SEP 1.1 P prac V1 pass
769 elle SEP 1.2 pe prac english pass
761 tanj SEP 1.3 N tech V2 pass
112 jon ERM 3.0 N prac phy fail
116 pal NAN 9.1 sc etc V1V2 pass
116 pal NAN 9.2 sc etc V1V3 fail
113 josh NAN 9.3 du etc. erp fail
100 sug EVV 9.1 sc NAN che pass
323 adi ERP 3.1 NAN fit math fail
323 adi ERP 3.2 NAN fit math fail
this is how my input data looks like.For missing value i have replaced it with string "NAN". There are duplicate records for student if they have changed any option.
python machine-learning scikit-learn
accuracy 0.97 and you need to improve?
– Dejan Marić
Nov 13 '18 at 7:00
Can you add the Prec, Recall for both class
– AI_Learning
Nov 13 '18 at 8:19
though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
– aim
Nov 13 '18 at 8:21
add a comment |
I have some students data with 1500 rows and 30 columns. i have used GradientBoostingClassifier
. All the data is categorical and in the range of 0-80 and 0-90 categories.
I need to build an prediction model to predict if the student will fail and pass. columns 'Status' is my target variable. below is the code i have used
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data=data.apply(le.fit_transform)
X=data.copy()
y=data['Status']
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Import Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = gb.predict(X_test)
below is my model performance.
Accuracy: 0.9719317419707395
Precision: 0.9090272918124562
Recall: 0.5650282731622445
Please let me know what should I do to improve the model and how to handle the wide range of categorical data. Also when I am testing the model on different dataset, most of the time the categories are changing either due to spelling mistake or new addition
std_id std_name Dem secn_id location bucket Primary_subject status
144 amy SEP 5.3 P dev english pass
230 mani SEV 11.3 E Tech math fail
299 sam DE 5.1 nap prac science pass
568 samy SEP 1.1 P prac V1 pass
769 elle SEP 1.2 pe prac english pass
761 tanj SEP 1.3 N tech V2 pass
112 jon ERM 3.0 N prac phy fail
116 pal NAN 9.1 sc etc V1V2 pass
116 pal NAN 9.2 sc etc V1V3 fail
113 josh NAN 9.3 du etc. erp fail
100 sug EVV 9.1 sc NAN che pass
323 adi ERP 3.1 NAN fit math fail
323 adi ERP 3.2 NAN fit math fail
this is how my input data looks like.For missing value i have replaced it with string "NAN". There are duplicate records for student if they have changed any option.
python machine-learning scikit-learn
I have some students data with 1500 rows and 30 columns. i have used GradientBoostingClassifier
. All the data is categorical and in the range of 0-80 and 0-90 categories.
I need to build an prediction model to predict if the student will fail and pass. columns 'Status' is my target variable. below is the code i have used
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data=data.apply(le.fit_transform)
X=data.copy()
y=data['Status']
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Import Gradient Boosting Classifier model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = gb.predict(X_test)
below is my model performance.
Accuracy: 0.9719317419707395
Precision: 0.9090272918124562
Recall: 0.5650282731622445
Please let me know what should I do to improve the model and how to handle the wide range of categorical data. Also when I am testing the model on different dataset, most of the time the categories are changing either due to spelling mistake or new addition
std_id std_name Dem secn_id location bucket Primary_subject status
144 amy SEP 5.3 P dev english pass
230 mani SEV 11.3 E Tech math fail
299 sam DE 5.1 nap prac science pass
568 samy SEP 1.1 P prac V1 pass
769 elle SEP 1.2 pe prac english pass
761 tanj SEP 1.3 N tech V2 pass
112 jon ERM 3.0 N prac phy fail
116 pal NAN 9.1 sc etc V1V2 pass
116 pal NAN 9.2 sc etc V1V3 fail
113 josh NAN 9.3 du etc. erp fail
100 sug EVV 9.1 sc NAN che pass
323 adi ERP 3.1 NAN fit math fail
323 adi ERP 3.2 NAN fit math fail
this is how my input data looks like.For missing value i have replaced it with string "NAN". There are duplicate records for student if they have changed any option.
python machine-learning scikit-learn
python machine-learning scikit-learn
edited Nov 13 '18 at 8:16
aim
asked Nov 13 '18 at 6:36
aimaim
477
477
accuracy 0.97 and you need to improve?
– Dejan Marić
Nov 13 '18 at 7:00
Can you add the Prec, Recall for both class
– AI_Learning
Nov 13 '18 at 8:19
though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
– aim
Nov 13 '18 at 8:21
add a comment |
accuracy 0.97 and you need to improve?
– Dejan Marić
Nov 13 '18 at 7:00
Can you add the Prec, Recall for both class
– AI_Learning
Nov 13 '18 at 8:19
though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
– aim
Nov 13 '18 at 8:21
accuracy 0.97 and you need to improve?
– Dejan Marić
Nov 13 '18 at 7:00
accuracy 0.97 and you need to improve?
– Dejan Marić
Nov 13 '18 at 7:00
Can you add the Prec, Recall for both class
– AI_Learning
Nov 13 '18 at 8:19
Can you add the Prec, Recall for both class
– AI_Learning
Nov 13 '18 at 8:19
though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
– aim
Nov 13 '18 at 8:21
though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
– aim
Nov 13 '18 at 8:21
add a comment |
2 Answers
2
active
oldest
votes
Your model performance is decent.
To improve further
Tune the parameteres of
GradientBoostingClassifier
. You can set values for parameters liken_estimators
,learning_rate
etc and check the performance of your model.
For this task I'll suggest GridSearchCVFeature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using
Random Forest
etc and use features with high importance.You can try different algorithms like
XGBoost
,LightGBM
or evenneural network
You can use cross-validator like Stratified ShuffleSplit
Regarding your next problem.
Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.
thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17
i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23
You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14
i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48
add a comment |
First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.
Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.
Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".
Classifier as Decision Trees/Random Forest are good option when handling categorical variables.
The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.
EDIT (after the add of the data table)
Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.
1
great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53275136%2fbuilding-a-classification-model%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your model performance is decent.
To improve further
Tune the parameteres of
GradientBoostingClassifier
. You can set values for parameters liken_estimators
,learning_rate
etc and check the performance of your model.
For this task I'll suggest GridSearchCVFeature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using
Random Forest
etc and use features with high importance.You can try different algorithms like
XGBoost
,LightGBM
or evenneural network
You can use cross-validator like Stratified ShuffleSplit
Regarding your next problem.
Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.
thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17
i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23
You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14
i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48
add a comment |
Your model performance is decent.
To improve further
Tune the parameteres of
GradientBoostingClassifier
. You can set values for parameters liken_estimators
,learning_rate
etc and check the performance of your model.
For this task I'll suggest GridSearchCVFeature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using
Random Forest
etc and use features with high importance.You can try different algorithms like
XGBoost
,LightGBM
or evenneural network
You can use cross-validator like Stratified ShuffleSplit
Regarding your next problem.
Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.
thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17
i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23
You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14
i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48
add a comment |
Your model performance is decent.
To improve further
Tune the parameteres of
GradientBoostingClassifier
. You can set values for parameters liken_estimators
,learning_rate
etc and check the performance of your model.
For this task I'll suggest GridSearchCVFeature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using
Random Forest
etc and use features with high importance.You can try different algorithms like
XGBoost
,LightGBM
or evenneural network
You can use cross-validator like Stratified ShuffleSplit
Regarding your next problem.
Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.
Your model performance is decent.
To improve further
Tune the parameteres of
GradientBoostingClassifier
. You can set values for parameters liken_estimators
,learning_rate
etc and check the performance of your model.
For this task I'll suggest GridSearchCVFeature Engineering: You can create new features from the existing ones. As you have not provided data it's hard to suggest anything. You can check the feature importance by using
Random Forest
etc and use features with high importance.You can try different algorithms like
XGBoost
,LightGBM
or evenneural network
You can use cross-validator like Stratified ShuffleSplit
Regarding your next problem.
Again it's hard to suggest anything without looking at any data.
To avoid spelling mistakes you can enforce users to select values from dropdown etc.
If that's not the case you can look at the difflib library which will find the closest match of your category.
answered Nov 13 '18 at 7:01
SociopathSociopath
3,64281635
3,64281635
thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17
i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23
You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14
i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48
add a comment |
thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17
i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23
You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14
i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48
thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17
thanks Akshay..!! i have added my data snippet
– aim
Nov 13 '18 at 8:17
i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23
i am trying to test this model on a different dataset. but while doing the label encoding for the categories value. i see discrepancy. like after doing encoding for data, column 'location' P value was converted to 1. but when i did the label encoding on data_new, column ' location' P was converted to 3. please let me know will this impact the performance of the model and how to fix this issue.
– aim
Nov 14 '18 at 8:23
You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14
You can predefined your categories and use them. So value of P will always be 1. See 2 nd part of this answer
– Sociopath
Nov 14 '18 at 9:14
i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48
i tried the above mentioned link. modified it a bit and created my own encoding in dict using names = (np.unique(df.values)) ran = len(np.unique(df.values)) my_enc = dict(zip( names, range(ran+1))) df.replace(my_enc, inplace=True). using this i have fixed the discrepancy. but length of my unique values is 115. so i have numbers from 0 to 115. i wonder if this will cause issue?? as the model may consider hierarchy like 115>93>45 etc and so on. any suggestions on this??
– aim
Nov 14 '18 at 11:48
add a comment |
First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.
Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.
Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".
Classifier as Decision Trees/Random Forest are good option when handling categorical variables.
The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.
EDIT (after the add of the data table)
Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.
1
great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05
add a comment |
First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.
Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.
Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".
Classifier as Decision Trees/Random Forest are good option when handling categorical variables.
The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.
EDIT (after the add of the data table)
Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.
1
great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05
add a comment |
First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.
Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.
Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".
Classifier as Decision Trees/Random Forest are good option when handling categorical variables.
The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.
EDIT (after the add of the data table)
Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.
First of all I would suggest to start with some data cleaning and data analysis.
The fact that your categories change due to mistakes need to be fixed in a preprocessing step. Here there are no too many shortcuts, you need to inspect and fix the data manually.
Check also the presence of missing value. If there are missing value you need to address also this issue. You can remove the samples (accepting the loss of information) or substitute the missing value with the average value for the specific feature. There exist other methods in the literature but as first step those two could do.
Please consider to check the number of samples that you have in each class. If the two classes are strongly unbalanced you could consider to look for solution that address "unbalanced data".
Classifier as Decision Trees/Random Forest are good option when handling categorical variables.
The use of a cross-validation to tune the hyper-parameter of the classifier could also improve the performance.
EDIT (after the add of the data table)
Probably you don't want to use the name of the students since that feature is not related to the success/ failure of the exam.
edited Nov 13 '18 at 9:21
answered Nov 13 '18 at 7:04
RobertoRoberto
50210
50210
1
great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05
add a comment |
1
great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05
1
1
great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05
great. You have covered the topics that I missed in my answer :-)
– Sociopath
Nov 13 '18 at 7:05
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53275136%2fbuilding-a-classification-model%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
accuracy 0.97 and you need to improve?
– Dejan Marić
Nov 13 '18 at 7:00
Can you add the Prec, Recall for both class
– AI_Learning
Nov 13 '18 at 8:19
though the accuracy is 0.97. when i use a new set of data, transform it and fit in this model it gives very wrong predictions
– aim
Nov 13 '18 at 8:21