RF model loses accuracy when I remove it from Pipeline

up vote
3
down vote

favorite

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...

I have an nlp pipeline that does basically the following:

rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])

I run it:

clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)

When I optimize I get accuracy in the high 90's with the following:

confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)

the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:

# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))

When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.

Any ideas?

asked Nov 10 at 18:33

Oct

697

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

add a comment |

up vote
3
down vote

favorite

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...

I have an nlp pipeline that does basically the following:

rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])

I run it:

clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)

When I optimize I get accuracy in the high 90's with the following:

confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)

# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))

When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.

Any ideas?

asked Nov 10 at 18:33

Oct

697

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

add a comment |

up vote
3
down vote

favorite

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...

I have an nlp pipeline that does basically the following:

rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])

I run it:

clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)

When I optimize I get accuracy in the high 90's with the following:

confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)

# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))

When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.

Any ideas?

asked Nov 10 at 18:33

Oct

697

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...

I have an nlp pipeline that does basically the following:

rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])

I run it:

clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)

When I optimize I get accuracy in the high 90's with the following:

confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)

# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))

When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.

Any ideas?

scikit-learn nlp random-forest spacy tfidfvectorizer

asked Nov 10 at 18:33

Oct

697

asked Nov 10 at 18:33

Oct

697

asked Nov 10 at 18:33

Oct

697

asked Nov 10 at 18:33

Oct

697

asked Nov 10 at 18:33

Oct

697

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

add a comment |

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13

add a comment |

1 Answer
1

active

oldest

votes

up vote
3
down vote

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict
clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242164%2frf-model-loses-accuracy-when-i-remove-it-from-pipeline%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict
clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

add a comment |

up vote
3
down vote

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict
clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

add a comment |

up vote
3
down vote

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict
clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.

Try this

# predict
clf.predict(tfidf.transform(X_test))

answered Nov 12 at 5:04

Tomáš Přinda

33317

answered Nov 12 at 5:04

Tomáš Přinda

33317

answered Nov 12 at 5:04

Tomáš Přinda

33317

answered Nov 12 at 5:04

Tomáš Přinda

33317

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth