Use one attribute only once in scikit-learn decision tree in python









up vote
7
down vote

favorite
1












I am using scikit-learn python module to create a decision tree, and its working like a charm. I would like to achieve one more thing. To make the tree to split on an attribute only once.



The reason behind this is because of my very strange dataset. I use a noisy dataset, and i am really interested in the noise as well. My class outcomes are binary let say [+,-]. I have a bunch of attributes with numbers mostly in the range of (0,1).



When scikit-learn creates the tree it splits on attributes multiple times, to make the tree "better". I understand that in this way the leaf nodes become more pure, but thats not the case i would like to achieve.



The thing i did was to define cutoffs for every attribute by counting the the information gain in different cutoffs, and choosing the maximum. In this way with "leave-one-out" and "1/3-2/3" cross validation techniques i get better results then the original tree.



The problem is that when i try to automatize this, i run into a problem around the lower and upper bound e.g. around 0 and 1 because most of the elements will be under/upper that and i get really high informational gain, cause one of the sets are pure, even if it only contains 1-2% of the full data.



All in all i would like to do something to make scikit-learn to only split on an attribute once.



If it cannot be done, do you guys have any advice how to generate those cutoffs in a nice way?



Thanks a lot!










share|improve this question

















  • 1




    I wonder if you can achieve what you want by tuning min_samples_leaf, min_samples_split or min_weight_fraction_leaf.
    – maxymoo
    Nov 15 '16 at 3:50










  • Is it indeed possible as the OP describes?
    – Dror
    Jun 1 '17 at 8:11










  • I havebt found a way with sckit, i wrote my own methods
    – Gábor Erdős
    Jun 3 '17 at 8:41






  • 1




    Do you really want to use each attribute only once? What's wrong if the same feature x_i is used both in the left and right subtrees? If you want to avoid overfitting, you could use gradient boosting with very small trees instead of a single tree.
    – David Dale
    Nov 2 '17 at 16:11










  • Have you tried just using logistic regression? by definition it will use every attribute only once.
    – Josep Valls
    Jun 14 at 23:36














up vote
7
down vote

favorite
1












I am using scikit-learn python module to create a decision tree, and its working like a charm. I would like to achieve one more thing. To make the tree to split on an attribute only once.



The reason behind this is because of my very strange dataset. I use a noisy dataset, and i am really interested in the noise as well. My class outcomes are binary let say [+,-]. I have a bunch of attributes with numbers mostly in the range of (0,1).



When scikit-learn creates the tree it splits on attributes multiple times, to make the tree "better". I understand that in this way the leaf nodes become more pure, but thats not the case i would like to achieve.



The thing i did was to define cutoffs for every attribute by counting the the information gain in different cutoffs, and choosing the maximum. In this way with "leave-one-out" and "1/3-2/3" cross validation techniques i get better results then the original tree.



The problem is that when i try to automatize this, i run into a problem around the lower and upper bound e.g. around 0 and 1 because most of the elements will be under/upper that and i get really high informational gain, cause one of the sets are pure, even if it only contains 1-2% of the full data.



All in all i would like to do something to make scikit-learn to only split on an attribute once.



If it cannot be done, do you guys have any advice how to generate those cutoffs in a nice way?



Thanks a lot!










share|improve this question

















  • 1




    I wonder if you can achieve what you want by tuning min_samples_leaf, min_samples_split or min_weight_fraction_leaf.
    – maxymoo
    Nov 15 '16 at 3:50










  • Is it indeed possible as the OP describes?
    – Dror
    Jun 1 '17 at 8:11










  • I havebt found a way with sckit, i wrote my own methods
    – Gábor Erdős
    Jun 3 '17 at 8:41






  • 1




    Do you really want to use each attribute only once? What's wrong if the same feature x_i is used both in the left and right subtrees? If you want to avoid overfitting, you could use gradient boosting with very small trees instead of a single tree.
    – David Dale
    Nov 2 '17 at 16:11










  • Have you tried just using logistic regression? by definition it will use every attribute only once.
    – Josep Valls
    Jun 14 at 23:36












up vote
7
down vote

favorite
1









up vote
7
down vote

favorite
1






1





I am using scikit-learn python module to create a decision tree, and its working like a charm. I would like to achieve one more thing. To make the tree to split on an attribute only once.



The reason behind this is because of my very strange dataset. I use a noisy dataset, and i am really interested in the noise as well. My class outcomes are binary let say [+,-]. I have a bunch of attributes with numbers mostly in the range of (0,1).



When scikit-learn creates the tree it splits on attributes multiple times, to make the tree "better". I understand that in this way the leaf nodes become more pure, but thats not the case i would like to achieve.



The thing i did was to define cutoffs for every attribute by counting the the information gain in different cutoffs, and choosing the maximum. In this way with "leave-one-out" and "1/3-2/3" cross validation techniques i get better results then the original tree.



The problem is that when i try to automatize this, i run into a problem around the lower and upper bound e.g. around 0 and 1 because most of the elements will be under/upper that and i get really high informational gain, cause one of the sets are pure, even if it only contains 1-2% of the full data.



All in all i would like to do something to make scikit-learn to only split on an attribute once.



If it cannot be done, do you guys have any advice how to generate those cutoffs in a nice way?



Thanks a lot!










share|improve this question













I am using scikit-learn python module to create a decision tree, and its working like a charm. I would like to achieve one more thing. To make the tree to split on an attribute only once.



The reason behind this is because of my very strange dataset. I use a noisy dataset, and i am really interested in the noise as well. My class outcomes are binary let say [+,-]. I have a bunch of attributes with numbers mostly in the range of (0,1).



When scikit-learn creates the tree it splits on attributes multiple times, to make the tree "better". I understand that in this way the leaf nodes become more pure, but thats not the case i would like to achieve.



The thing i did was to define cutoffs for every attribute by counting the the information gain in different cutoffs, and choosing the maximum. In this way with "leave-one-out" and "1/3-2/3" cross validation techniques i get better results then the original tree.



The problem is that when i try to automatize this, i run into a problem around the lower and upper bound e.g. around 0 and 1 because most of the elements will be under/upper that and i get really high informational gain, cause one of the sets are pure, even if it only contains 1-2% of the full data.



All in all i would like to do something to make scikit-learn to only split on an attribute once.



If it cannot be done, do you guys have any advice how to generate those cutoffs in a nice way?



Thanks a lot!







python scikit-learn decision-tree






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 26 '15 at 11:30









Gábor Erdős

2,2311829




2,2311829







  • 1




    I wonder if you can achieve what you want by tuning min_samples_leaf, min_samples_split or min_weight_fraction_leaf.
    – maxymoo
    Nov 15 '16 at 3:50










  • Is it indeed possible as the OP describes?
    – Dror
    Jun 1 '17 at 8:11










  • I havebt found a way with sckit, i wrote my own methods
    – Gábor Erdős
    Jun 3 '17 at 8:41






  • 1




    Do you really want to use each attribute only once? What's wrong if the same feature x_i is used both in the left and right subtrees? If you want to avoid overfitting, you could use gradient boosting with very small trees instead of a single tree.
    – David Dale
    Nov 2 '17 at 16:11










  • Have you tried just using logistic regression? by definition it will use every attribute only once.
    – Josep Valls
    Jun 14 at 23:36












  • 1




    I wonder if you can achieve what you want by tuning min_samples_leaf, min_samples_split or min_weight_fraction_leaf.
    – maxymoo
    Nov 15 '16 at 3:50










  • Is it indeed possible as the OP describes?
    – Dror
    Jun 1 '17 at 8:11










  • I havebt found a way with sckit, i wrote my own methods
    – Gábor Erdős
    Jun 3 '17 at 8:41






  • 1




    Do you really want to use each attribute only once? What's wrong if the same feature x_i is used both in the left and right subtrees? If you want to avoid overfitting, you could use gradient boosting with very small trees instead of a single tree.
    – David Dale
    Nov 2 '17 at 16:11










  • Have you tried just using logistic regression? by definition it will use every attribute only once.
    – Josep Valls
    Jun 14 at 23:36







1




1




I wonder if you can achieve what you want by tuning min_samples_leaf, min_samples_split or min_weight_fraction_leaf.
– maxymoo
Nov 15 '16 at 3:50




I wonder if you can achieve what you want by tuning min_samples_leaf, min_samples_split or min_weight_fraction_leaf.
– maxymoo
Nov 15 '16 at 3:50












Is it indeed possible as the OP describes?
– Dror
Jun 1 '17 at 8:11




Is it indeed possible as the OP describes?
– Dror
Jun 1 '17 at 8:11












I havebt found a way with sckit, i wrote my own methods
– Gábor Erdős
Jun 3 '17 at 8:41




I havebt found a way with sckit, i wrote my own methods
– Gábor Erdős
Jun 3 '17 at 8:41




1




1




Do you really want to use each attribute only once? What's wrong if the same feature x_i is used both in the left and right subtrees? If you want to avoid overfitting, you could use gradient boosting with very small trees instead of a single tree.
– David Dale
Nov 2 '17 at 16:11




Do you really want to use each attribute only once? What's wrong if the same feature x_i is used both in the left and right subtrees? If you want to avoid overfitting, you could use gradient boosting with very small trees instead of a single tree.
– David Dale
Nov 2 '17 at 16:11












Have you tried just using logistic regression? by definition it will use every attribute only once.
– Josep Valls
Jun 14 at 23:36




Have you tried just using logistic regression? by definition it will use every attribute only once.
– Josep Valls
Jun 14 at 23:36












1 Answer
1






active

oldest

votes

















up vote
0
down vote













I am not giving a method to directly deal with stopping the Classifier from using a feature multiple times. (Although you could do it by defining your own splitter and wiring it in, it is a lot of work.)



I would suggest making sure that you are balancing your classes in the first place, take a look at the class_weight parameter for details. That should help a lot in your issue. But if that does not work you can still enforce that there are no leafs having too small weight in them using the min_weight_fraction_leaf or similar parameters as suggested by maxymoo.






share|improve this answer




















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33937532%2fuse-one-attribute-only-once-in-scikit-learn-decision-tree-in-python%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    I am not giving a method to directly deal with stopping the Classifier from using a feature multiple times. (Although you could do it by defining your own splitter and wiring it in, it is a lot of work.)



    I would suggest making sure that you are balancing your classes in the first place, take a look at the class_weight parameter for details. That should help a lot in your issue. But if that does not work you can still enforce that there are no leafs having too small weight in them using the min_weight_fraction_leaf or similar parameters as suggested by maxymoo.






    share|improve this answer
























      up vote
      0
      down vote













      I am not giving a method to directly deal with stopping the Classifier from using a feature multiple times. (Although you could do it by defining your own splitter and wiring it in, it is a lot of work.)



      I would suggest making sure that you are balancing your classes in the first place, take a look at the class_weight parameter for details. That should help a lot in your issue. But if that does not work you can still enforce that there are no leafs having too small weight in them using the min_weight_fraction_leaf or similar parameters as suggested by maxymoo.






      share|improve this answer






















        up vote
        0
        down vote










        up vote
        0
        down vote









        I am not giving a method to directly deal with stopping the Classifier from using a feature multiple times. (Although you could do it by defining your own splitter and wiring it in, it is a lot of work.)



        I would suggest making sure that you are balancing your classes in the first place, take a look at the class_weight parameter for details. That should help a lot in your issue. But if that does not work you can still enforce that there are no leafs having too small weight in them using the min_weight_fraction_leaf or similar parameters as suggested by maxymoo.






        share|improve this answer












        I am not giving a method to directly deal with stopping the Classifier from using a feature multiple times. (Although you could do it by defining your own splitter and wiring it in, it is a lot of work.)



        I would suggest making sure that you are balancing your classes in the first place, take a look at the class_weight parameter for details. That should help a lot in your issue. But if that does not work you can still enforce that there are no leafs having too small weight in them using the min_weight_fraction_leaf or similar parameters as suggested by maxymoo.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 11 at 1:22









        zsomko

        3616




        3616



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33937532%2fuse-one-attribute-only-once-in-scikit-learn-decision-tree-in-python%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Top Tejano songwriter Luis Silva dead of heart attack at 64

            天津地下鉄3号線

            How to fill missing numeric if any value in a subset is missing, all other columns with the same subset are missing