Check format similarity between two strings










0















I have a string format which is like:



  • the word must be 15 letters long

  • first 8 letters are date

Example: '2009060712ab56c'



Let's say I want to compare this with another string and give a percentage of format similarity like:



result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')


result is let's say in this case 80%.



Is there way of doing this?










share|improve this question



















  • 3





    What do you mean by "format similarity"? Is Levenshtein distance enough?

    – JETM
    Nov 15 '18 at 12:39











  • Have you tried this stackoverflow.com/a/17388505/8835357

    – specbug
    Nov 15 '18 at 12:41












  • Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.

    – quant
    Nov 15 '18 at 12:57











  • They aren't both 15 characters long.

    – Neil
    Nov 15 '18 at 13:01















0















I have a string format which is like:



  • the word must be 15 letters long

  • first 8 letters are date

Example: '2009060712ab56c'



Let's say I want to compare this with another string and give a percentage of format similarity like:



result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')


result is let's say in this case 80%.



Is there way of doing this?










share|improve this question



















  • 3





    What do you mean by "format similarity"? Is Levenshtein distance enough?

    – JETM
    Nov 15 '18 at 12:39











  • Have you tried this stackoverflow.com/a/17388505/8835357

    – specbug
    Nov 15 '18 at 12:41












  • Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.

    – quant
    Nov 15 '18 at 12:57











  • They aren't both 15 characters long.

    – Neil
    Nov 15 '18 at 13:01













0












0








0








I have a string format which is like:



  • the word must be 15 letters long

  • first 8 letters are date

Example: '2009060712ab56c'



Let's say I want to compare this with another string and give a percentage of format similarity like:



result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')


result is let's say in this case 80%.



Is there way of doing this?










share|improve this question
















I have a string format which is like:



  • the word must be 15 letters long

  • first 8 letters are date

Example: '2009060712ab56c'



Let's say I want to compare this with another string and give a percentage of format similarity like:



result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')


result is let's say in this case 80%.



Is there way of doing this?







python string format fuzzy-comparison






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 12:40









jonrsharpe

77.8k11105213




77.8k11105213










asked Nov 15 '18 at 12:30









s900ns900n

462618




462618







  • 3





    What do you mean by "format similarity"? Is Levenshtein distance enough?

    – JETM
    Nov 15 '18 at 12:39











  • Have you tried this stackoverflow.com/a/17388505/8835357

    – specbug
    Nov 15 '18 at 12:41












  • Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.

    – quant
    Nov 15 '18 at 12:57











  • They aren't both 15 characters long.

    – Neil
    Nov 15 '18 at 13:01












  • 3





    What do you mean by "format similarity"? Is Levenshtein distance enough?

    – JETM
    Nov 15 '18 at 12:39











  • Have you tried this stackoverflow.com/a/17388505/8835357

    – specbug
    Nov 15 '18 at 12:41












  • Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.

    – quant
    Nov 15 '18 at 12:57











  • They aren't both 15 characters long.

    – Neil
    Nov 15 '18 at 13:01







3




3





What do you mean by "format similarity"? Is Levenshtein distance enough?

– JETM
Nov 15 '18 at 12:39





What do you mean by "format similarity"? Is Levenshtein distance enough?

– JETM
Nov 15 '18 at 12:39













Have you tried this stackoverflow.com/a/17388505/8835357

– specbug
Nov 15 '18 at 12:41






Have you tried this stackoverflow.com/a/17388505/8835357

– specbug
Nov 15 '18 at 12:41














Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.

– quant
Nov 15 '18 at 12:57





Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.

– quant
Nov 15 '18 at 12:57













They aren't both 15 characters long.

– Neil
Nov 15 '18 at 13:01





They aren't both 15 characters long.

– Neil
Nov 15 '18 at 13:01












2 Answers
2






active

oldest

votes


















0














Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:



import re 

def determine_similarity(string, other):
length_string = len(string) # use len to get the number of characters in the string
length_other = len(other)
number_of_numbers_string = _determine_number_of_numbers(string)
number_of_numbers_other = _determine_number_of_numbers(other)

<some logic here to create a metric of simiarity>
<find the differences and divide them?>


LEADING_NUMBERS = re.compile(
r"^" # anchor at start of string
r"[0-9]" # Must be a number
r"+" # One or more matches
)

def _determine_number_of_numbers(string):
"""
Determine how many LEADING numbers are in a string
"""
match = LEADING_NUMBERS.search(string)
if match is not None:
length = len(match.group()) # Number of numbers is length of number match group
else:
length = 0 # No match means no numbers

<You might want to check whether the numbers constitute a date within a certain range or something like that>
<For example, take the first four number and check whether the year is between 1980 and 2018>
return length





share|improve this answer






























    0














    As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).



    You could create your own implementation of "edit distance" that matches your custom rules such as:



    • first 8 characters are numeric and form valid date

    • total string of 15 characters





    share|improve this answer
























      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319564%2fcheck-format-similarity-between-two-strings%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0














      Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:



      import re 

      def determine_similarity(string, other):
      length_string = len(string) # use len to get the number of characters in the string
      length_other = len(other)
      number_of_numbers_string = _determine_number_of_numbers(string)
      number_of_numbers_other = _determine_number_of_numbers(other)

      <some logic here to create a metric of simiarity>
      <find the differences and divide them?>


      LEADING_NUMBERS = re.compile(
      r"^" # anchor at start of string
      r"[0-9]" # Must be a number
      r"+" # One or more matches
      )

      def _determine_number_of_numbers(string):
      """
      Determine how many LEADING numbers are in a string
      """
      match = LEADING_NUMBERS.search(string)
      if match is not None:
      length = len(match.group()) # Number of numbers is length of number match group
      else:
      length = 0 # No match means no numbers

      <You might want to check whether the numbers constitute a date within a certain range or something like that>
      <For example, take the first four number and check whether the year is between 1980 and 2018>
      return length





      share|improve this answer



























        0














        Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:



        import re 

        def determine_similarity(string, other):
        length_string = len(string) # use len to get the number of characters in the string
        length_other = len(other)
        number_of_numbers_string = _determine_number_of_numbers(string)
        number_of_numbers_other = _determine_number_of_numbers(other)

        <some logic here to create a metric of simiarity>
        <find the differences and divide them?>


        LEADING_NUMBERS = re.compile(
        r"^" # anchor at start of string
        r"[0-9]" # Must be a number
        r"+" # One or more matches
        )

        def _determine_number_of_numbers(string):
        """
        Determine how many LEADING numbers are in a string
        """
        match = LEADING_NUMBERS.search(string)
        if match is not None:
        length = len(match.group()) # Number of numbers is length of number match group
        else:
        length = 0 # No match means no numbers

        <You might want to check whether the numbers constitute a date within a certain range or something like that>
        <For example, take the first four number and check whether the year is between 1980 and 2018>
        return length





        share|improve this answer

























          0












          0








          0







          Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:



          import re 

          def determine_similarity(string, other):
          length_string = len(string) # use len to get the number of characters in the string
          length_other = len(other)
          number_of_numbers_string = _determine_number_of_numbers(string)
          number_of_numbers_other = _determine_number_of_numbers(other)

          <some logic here to create a metric of simiarity>
          <find the differences and divide them?>


          LEADING_NUMBERS = re.compile(
          r"^" # anchor at start of string
          r"[0-9]" # Must be a number
          r"+" # One or more matches
          )

          def _determine_number_of_numbers(string):
          """
          Determine how many LEADING numbers are in a string
          """
          match = LEADING_NUMBERS.search(string)
          if match is not None:
          length = len(match.group()) # Number of numbers is length of number match group
          else:
          length = 0 # No match means no numbers

          <You might want to check whether the numbers constitute a date within a certain range or something like that>
          <For example, take the first four number and check whether the year is between 1980 and 2018>
          return length





          share|improve this answer













          Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:



          import re 

          def determine_similarity(string, other):
          length_string = len(string) # use len to get the number of characters in the string
          length_other = len(other)
          number_of_numbers_string = _determine_number_of_numbers(string)
          number_of_numbers_other = _determine_number_of_numbers(other)

          <some logic here to create a metric of simiarity>
          <find the differences and divide them?>


          LEADING_NUMBERS = re.compile(
          r"^" # anchor at start of string
          r"[0-9]" # Must be a number
          r"+" # One or more matches
          )

          def _determine_number_of_numbers(string):
          """
          Determine how many LEADING numbers are in a string
          """
          match = LEADING_NUMBERS.search(string)
          if match is not None:
          length = len(match.group()) # Number of numbers is length of number match group
          else:
          length = 0 # No match means no numbers

          <You might want to check whether the numbers constitute a date within a certain range or something like that>
          <For example, take the first four number and check whether the year is between 1980 and 2018>
          return length






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 15 '18 at 12:55









          NeilNeil

          532110




          532110























              0














              As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).



              You could create your own implementation of "edit distance" that matches your custom rules such as:



              • first 8 characters are numeric and form valid date

              • total string of 15 characters





              share|improve this answer





























                0














                As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).



                You could create your own implementation of "edit distance" that matches your custom rules such as:



                • first 8 characters are numeric and form valid date

                • total string of 15 characters





                share|improve this answer



























                  0












                  0








                  0







                  As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).



                  You could create your own implementation of "edit distance" that matches your custom rules such as:



                  • first 8 characters are numeric and form valid date

                  • total string of 15 characters





                  share|improve this answer















                  As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).



                  You could create your own implementation of "edit distance" that matches your custom rules such as:



                  • first 8 characters are numeric and form valid date

                  • total string of 15 characters






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 15 '18 at 13:05

























                  answered Nov 15 '18 at 12:57









                  jrshjrsh

                  1269




                  1269



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319564%2fcheck-format-similarity-between-two-strings%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Top Tejano songwriter Luis Silva dead of heart attack at 64

                      政党

                      天津地下鉄3号線