RF model loses accuracy when I remove it from Pipeline









up vote
3
down vote

favorite












Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...



I have an nlp pipeline that does basically the following:



rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])


I run it:



clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)


When I optimize I get accuracy in the high 90's with the following:



confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)


the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:



# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))


When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.



Any ideas?










share|improve this question





















  • Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
    – hellpanderr
    Nov 11 at 11:13














up vote
3
down vote

favorite












Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...



I have an nlp pipeline that does basically the following:



rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])


I run it:



clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)


When I optimize I get accuracy in the high 90's with the following:



confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)


the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:



# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))


When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.



Any ideas?










share|improve this question





















  • Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
    – hellpanderr
    Nov 11 at 11:13












up vote
3
down vote

favorite









up vote
3
down vote

favorite











Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...



I have an nlp pipeline that does basically the following:



rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])


I run it:



clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)


When I optimize I get accuracy in the high 90's with the following:



confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)


the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:



# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))


When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.



Any ideas?










share|improve this question













Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...



I have an nlp pipeline that does basically the following:



rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])


I run it:



clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)


When I optimize I get accuracy in the high 90's with the following:



confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)


the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:



# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))


When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.



Any ideas?







scikit-learn nlp random-forest spacy tfidfvectorizer






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 at 18:33









Oct

697




697











  • Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
    – hellpanderr
    Nov 11 at 11:13
















  • Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
    – hellpanderr
    Nov 11 at 11:13















Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13




Could you make your example reproducible by using one of scikit-learn's included datasets? scikit-learn.org/stable/tutorial/text_analytics/…
– hellpanderr
Nov 11 at 11:13












1 Answer
1






active

oldest

votes

















up vote
3
down vote













For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.



Try this



# predict
clf.predict(tfidf.transform(X_test))





share|improve this answer




















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242164%2frf-model-loses-accuracy-when-i-remove-it-from-pipeline%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote













    For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.



    Try this



    # predict
    clf.predict(tfidf.transform(X_test))





    share|improve this answer
























      up vote
      3
      down vote













      For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.



      Try this



      # predict
      clf.predict(tfidf.transform(X_test))





      share|improve this answer






















        up vote
        3
        down vote










        up vote
        3
        down vote









        For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.



        Try this



        # predict
        clf.predict(tfidf.transform(X_test))





        share|improve this answer












        For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.



        Try this



        # predict
        clf.predict(tfidf.transform(X_test))






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 12 at 5:04









        Tomáš Přinda

        33317




        33317



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242164%2frf-model-loses-accuracy-when-i-remove-it-from-pipeline%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Top Tejano songwriter Luis Silva dead of heart attack at 64

            ReactJS Fetched API data displays live - need Data displayed static

            政党