How to: Exclude a row from stddev/mean calculations and join later










0















So I'm trying to find machines that throw exceptionally many exceptions as compared to their group by environment and function. The intuition is that the load and task type should be fairly similar across the group, so if one machine throws a lot more exceptions, it's probably somehow in a bad state and should be serviced.



This works fairly well for large groups of machines, but for smaller groups there's a problem: If there's very few machines, and only one of them is throwing a lot of exceptions, it may not get detected. The reason is that because that datapoint is pars of the general stddev and mean calculation for the group, the mean and stddev are biased towards this outlier.



The solution would be to either somehow subtract that data point from the calculated stddev and mean for the entire group, or to calculate the stddev and mean for each machine/environment/function combination (excluding the machine in question from the stddev/mean calculation) instead of just by environment/function group.



Here's the current code that does it by environment/function. Is there an elegant solution to expand it to do the machine/environment/function?



// Find sick machines
let SickMachinesAt = (AtTime:datetime , TimeWindow:timespan = 1h, Sigmas:double = 3.0, MinimumExceptionsToTrigger:int = 10) summarize count() by environmentName, machineFunction, machineName










share|improve this question


























    0















    So I'm trying to find machines that throw exceptionally many exceptions as compared to their group by environment and function. The intuition is that the load and task type should be fairly similar across the group, so if one machine throws a lot more exceptions, it's probably somehow in a bad state and should be serviced.



    This works fairly well for large groups of machines, but for smaller groups there's a problem: If there's very few machines, and only one of them is throwing a lot of exceptions, it may not get detected. The reason is that because that datapoint is pars of the general stddev and mean calculation for the group, the mean and stddev are biased towards this outlier.



    The solution would be to either somehow subtract that data point from the calculated stddev and mean for the entire group, or to calculate the stddev and mean for each machine/environment/function combination (excluding the machine in question from the stddev/mean calculation) instead of just by environment/function group.



    Here's the current code that does it by environment/function. Is there an elegant solution to expand it to do the machine/environment/function?



    // Find sick machines
    let SickMachinesAt = (AtTime:datetime , TimeWindow:timespan = 1h, Sigmas:double = 3.0, MinimumExceptionsToTrigger:int = 10) summarize count() by environmentName, machineFunction, machineName










    share|improve this question
























      0












      0








      0








      So I'm trying to find machines that throw exceptionally many exceptions as compared to their group by environment and function. The intuition is that the load and task type should be fairly similar across the group, so if one machine throws a lot more exceptions, it's probably somehow in a bad state and should be serviced.



      This works fairly well for large groups of machines, but for smaller groups there's a problem: If there's very few machines, and only one of them is throwing a lot of exceptions, it may not get detected. The reason is that because that datapoint is pars of the general stddev and mean calculation for the group, the mean and stddev are biased towards this outlier.



      The solution would be to either somehow subtract that data point from the calculated stddev and mean for the entire group, or to calculate the stddev and mean for each machine/environment/function combination (excluding the machine in question from the stddev/mean calculation) instead of just by environment/function group.



      Here's the current code that does it by environment/function. Is there an elegant solution to expand it to do the machine/environment/function?



      // Find sick machines
      let SickMachinesAt = (AtTime:datetime , TimeWindow:timespan = 1h, Sigmas:double = 3.0, MinimumExceptionsToTrigger:int = 10) summarize count() by environmentName, machineFunction, machineName










      share|improve this question














      So I'm trying to find machines that throw exceptionally many exceptions as compared to their group by environment and function. The intuition is that the load and task type should be fairly similar across the group, so if one machine throws a lot more exceptions, it's probably somehow in a bad state and should be serviced.



      This works fairly well for large groups of machines, but for smaller groups there's a problem: If there's very few machines, and only one of them is throwing a lot of exceptions, it may not get detected. The reason is that because that datapoint is pars of the general stddev and mean calculation for the group, the mean and stddev are biased towards this outlier.



      The solution would be to either somehow subtract that data point from the calculated stddev and mean for the entire group, or to calculate the stddev and mean for each machine/environment/function combination (excluding the machine in question from the stddev/mean calculation) instead of just by environment/function group.



      Here's the current code that does it by environment/function. Is there an elegant solution to expand it to do the machine/environment/function?



      // Find sick machines
      let SickMachinesAt = (AtTime:datetime , TimeWindow:timespan = 1h, Sigmas:double = 3.0, MinimumExceptionsToTrigger:int = 10) summarize count() by environmentName, machineFunction, machineName







      kusto azure-data-explorer






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 14 '18 at 19:20









      Anders Sewerin JohansenAnders Sewerin Johansen

      607826




      607826






















          1 Answer
          1






          active

          oldest

          votes


















          1














          One option to avoid missed detects due to strong outlier values is to use percentiles based detection. For this purpose you could use make-series followed by the built-in series_outliers function.



          Another option is to remove outliers from the calculation and then join back on the data and requires multiple joins. Assuming your exceptions are in Exceptions which contain the dimesnions: environmentName, machineFunction, machineName, you could remove all machines with counts above the 98th percentile with the following pseudo-query:



          let ExceptionsCounts = Exceptions 
          | summarize counts = count() by environmentName, machineFunction, machineName;
          let ExceptionsCleansed = ExceptionsCounts
          | summarize p98 = percentile(counts, 98) by environmentName, machineFunction
          | join kind=inner (ExceptionsCounts) on environmentName, machineFunction
          | where counts < p98;


          From there you could use ExceptionsCleansed to calculate the mean/stddev and continue with the detection on the original Exceptions with the calculated figures with exactly the same query you posted.






          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53307372%2fhow-to-exclude-a-row-from-stddev-mean-calculations-and-join-later%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            One option to avoid missed detects due to strong outlier values is to use percentiles based detection. For this purpose you could use make-series followed by the built-in series_outliers function.



            Another option is to remove outliers from the calculation and then join back on the data and requires multiple joins. Assuming your exceptions are in Exceptions which contain the dimesnions: environmentName, machineFunction, machineName, you could remove all machines with counts above the 98th percentile with the following pseudo-query:



            let ExceptionsCounts = Exceptions 
            | summarize counts = count() by environmentName, machineFunction, machineName;
            let ExceptionsCleansed = ExceptionsCounts
            | summarize p98 = percentile(counts, 98) by environmentName, machineFunction
            | join kind=inner (ExceptionsCounts) on environmentName, machineFunction
            | where counts < p98;


            From there you could use ExceptionsCleansed to calculate the mean/stddev and continue with the detection on the original Exceptions with the calculated figures with exactly the same query you posted.






            share|improve this answer



























              1














              One option to avoid missed detects due to strong outlier values is to use percentiles based detection. For this purpose you could use make-series followed by the built-in series_outliers function.



              Another option is to remove outliers from the calculation and then join back on the data and requires multiple joins. Assuming your exceptions are in Exceptions which contain the dimesnions: environmentName, machineFunction, machineName, you could remove all machines with counts above the 98th percentile with the following pseudo-query:



              let ExceptionsCounts = Exceptions 
              | summarize counts = count() by environmentName, machineFunction, machineName;
              let ExceptionsCleansed = ExceptionsCounts
              | summarize p98 = percentile(counts, 98) by environmentName, machineFunction
              | join kind=inner (ExceptionsCounts) on environmentName, machineFunction
              | where counts < p98;


              From there you could use ExceptionsCleansed to calculate the mean/stddev and continue with the detection on the original Exceptions with the calculated figures with exactly the same query you posted.






              share|improve this answer

























                1












                1








                1







                One option to avoid missed detects due to strong outlier values is to use percentiles based detection. For this purpose you could use make-series followed by the built-in series_outliers function.



                Another option is to remove outliers from the calculation and then join back on the data and requires multiple joins. Assuming your exceptions are in Exceptions which contain the dimesnions: environmentName, machineFunction, machineName, you could remove all machines with counts above the 98th percentile with the following pseudo-query:



                let ExceptionsCounts = Exceptions 
                | summarize counts = count() by environmentName, machineFunction, machineName;
                let ExceptionsCleansed = ExceptionsCounts
                | summarize p98 = percentile(counts, 98) by environmentName, machineFunction
                | join kind=inner (ExceptionsCounts) on environmentName, machineFunction
                | where counts < p98;


                From there you could use ExceptionsCleansed to calculate the mean/stddev and continue with the detection on the original Exceptions with the calculated figures with exactly the same query you posted.






                share|improve this answer













                One option to avoid missed detects due to strong outlier values is to use percentiles based detection. For this purpose you could use make-series followed by the built-in series_outliers function.



                Another option is to remove outliers from the calculation and then join back on the data and requires multiple joins. Assuming your exceptions are in Exceptions which contain the dimesnions: environmentName, machineFunction, machineName, you could remove all machines with counts above the 98th percentile with the following pseudo-query:



                let ExceptionsCounts = Exceptions 
                | summarize counts = count() by environmentName, machineFunction, machineName;
                let ExceptionsCleansed = ExceptionsCounts
                | summarize p98 = percentile(counts, 98) by environmentName, machineFunction
                | join kind=inner (ExceptionsCounts) on environmentName, machineFunction
                | where counts < p98;


                From there you could use ExceptionsCleansed to calculate the mean/stddev and continue with the detection on the original Exceptions with the calculated figures with exactly the same query you posted.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 19 '18 at 17:59









                RoyORoyO

                261




                261





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53307372%2fhow-to-exclude-a-row-from-stddev-mean-calculations-and-join-later%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Top Tejano songwriter Luis Silva dead of heart attack at 64

                    ReactJS Fetched API data displays live - need Data displayed static

                    馬尼拉人質事件