Check format similarity between two strings
I have a string format which is like:
- the word must be 15 letters long
- first 8 letters are date
Example: '2009060712ab56c'
Let's say I want to compare this with another string and give a percentage of format similarity like:
result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')
result is let's say in this case 80%.
Is there way of doing this?
python string format fuzzy-comparison
add a comment |
I have a string format which is like:
- the word must be 15 letters long
- first 8 letters are date
Example: '2009060712ab56c'
Let's say I want to compare this with another string and give a percentage of format similarity like:
result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')
result is let's say in this case 80%.
Is there way of doing this?
python string format fuzzy-comparison
3
What do you mean by "format similarity"? Is Levenshtein distance enough?
– JETM
Nov 15 '18 at 12:39
Have you tried this stackoverflow.com/a/17388505/8835357
– specbug
Nov 15 '18 at 12:41
Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.
– quant
Nov 15 '18 at 12:57
They aren't both 15 characters long.
– Neil
Nov 15 '18 at 13:01
add a comment |
I have a string format which is like:
- the word must be 15 letters long
- first 8 letters are date
Example: '2009060712ab56c'
Let's say I want to compare this with another string and give a percentage of format similarity like:
result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')
result is let's say in this case 80%.
Is there way of doing this?
python string format fuzzy-comparison
I have a string format which is like:
- the word must be 15 letters long
- first 8 letters are date
Example: '2009060712ab56c'
Let's say I want to compare this with another string and give a percentage of format similarity like:
result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')
result is let's say in this case 80%.
Is there way of doing this?
python string format fuzzy-comparison
python string format fuzzy-comparison
edited Nov 15 '18 at 12:40
jonrsharpe
77.8k11105213
77.8k11105213
asked Nov 15 '18 at 12:30
s900ns900n
462618
462618
3
What do you mean by "format similarity"? Is Levenshtein distance enough?
– JETM
Nov 15 '18 at 12:39
Have you tried this stackoverflow.com/a/17388505/8835357
– specbug
Nov 15 '18 at 12:41
Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.
– quant
Nov 15 '18 at 12:57
They aren't both 15 characters long.
– Neil
Nov 15 '18 at 13:01
add a comment |
3
What do you mean by "format similarity"? Is Levenshtein distance enough?
– JETM
Nov 15 '18 at 12:39
Have you tried this stackoverflow.com/a/17388505/8835357
– specbug
Nov 15 '18 at 12:41
Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.
– quant
Nov 15 '18 at 12:57
They aren't both 15 characters long.
– Neil
Nov 15 '18 at 13:01
3
3
What do you mean by "format similarity"? Is Levenshtein distance enough?
– JETM
Nov 15 '18 at 12:39
What do you mean by "format similarity"? Is Levenshtein distance enough?
– JETM
Nov 15 '18 at 12:39
Have you tried this stackoverflow.com/a/17388505/8835357
– specbug
Nov 15 '18 at 12:41
Have you tried this stackoverflow.com/a/17388505/8835357
– specbug
Nov 15 '18 at 12:41
Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.
– quant
Nov 15 '18 at 12:57
Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.
– quant
Nov 15 '18 at 12:57
They aren't both 15 characters long.
– Neil
Nov 15 '18 at 13:01
They aren't both 15 characters long.
– Neil
Nov 15 '18 at 13:01
add a comment |
2 Answers
2
active
oldest
votes
Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:
import re
def determine_similarity(string, other):
length_string = len(string) # use len to get the number of characters in the string
length_other = len(other)
number_of_numbers_string = _determine_number_of_numbers(string)
number_of_numbers_other = _determine_number_of_numbers(other)
<some logic here to create a metric of simiarity>
<find the differences and divide them?>
LEADING_NUMBERS = re.compile(
r"^" # anchor at start of string
r"[0-9]" # Must be a number
r"+" # One or more matches
)
def _determine_number_of_numbers(string):
"""
Determine how many LEADING numbers are in a string
"""
match = LEADING_NUMBERS.search(string)
if match is not None:
length = len(match.group()) # Number of numbers is length of number match group
else:
length = 0 # No match means no numbers
<You might want to check whether the numbers constitute a date within a certain range or something like that>
<For example, take the first four number and check whether the year is between 1980 and 2018>
return length
add a comment |
As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).
You could create your own implementation of "edit distance" that matches your custom rules such as:
- first 8 characters are numeric and form valid date
- total string of 15 characters
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319564%2fcheck-format-similarity-between-two-strings%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:
import re
def determine_similarity(string, other):
length_string = len(string) # use len to get the number of characters in the string
length_other = len(other)
number_of_numbers_string = _determine_number_of_numbers(string)
number_of_numbers_other = _determine_number_of_numbers(other)
<some logic here to create a metric of simiarity>
<find the differences and divide them?>
LEADING_NUMBERS = re.compile(
r"^" # anchor at start of string
r"[0-9]" # Must be a number
r"+" # One or more matches
)
def _determine_number_of_numbers(string):
"""
Determine how many LEADING numbers are in a string
"""
match = LEADING_NUMBERS.search(string)
if match is not None:
length = len(match.group()) # Number of numbers is length of number match group
else:
length = 0 # No match means no numbers
<You might want to check whether the numbers constitute a date within a certain range or something like that>
<For example, take the first four number and check whether the year is between 1980 and 2018>
return length
add a comment |
Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:
import re
def determine_similarity(string, other):
length_string = len(string) # use len to get the number of characters in the string
length_other = len(other)
number_of_numbers_string = _determine_number_of_numbers(string)
number_of_numbers_other = _determine_number_of_numbers(other)
<some logic here to create a metric of simiarity>
<find the differences and divide them?>
LEADING_NUMBERS = re.compile(
r"^" # anchor at start of string
r"[0-9]" # Must be a number
r"+" # One or more matches
)
def _determine_number_of_numbers(string):
"""
Determine how many LEADING numbers are in a string
"""
match = LEADING_NUMBERS.search(string)
if match is not None:
length = len(match.group()) # Number of numbers is length of number match group
else:
length = 0 # No match means no numbers
<You might want to check whether the numbers constitute a date within a certain range or something like that>
<For example, take the first four number and check whether the year is between 1980 and 2018>
return length
add a comment |
Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:
import re
def determine_similarity(string, other):
length_string = len(string) # use len to get the number of characters in the string
length_other = len(other)
number_of_numbers_string = _determine_number_of_numbers(string)
number_of_numbers_other = _determine_number_of_numbers(other)
<some logic here to create a metric of simiarity>
<find the differences and divide them?>
LEADING_NUMBERS = re.compile(
r"^" # anchor at start of string
r"[0-9]" # Must be a number
r"+" # One or more matches
)
def _determine_number_of_numbers(string):
"""
Determine how many LEADING numbers are in a string
"""
match = LEADING_NUMBERS.search(string)
if match is not None:
length = len(match.group()) # Number of numbers is length of number match group
else:
length = 0 # No match means no numbers
<You might want to check whether the numbers constitute a date within a certain range or something like that>
<For example, take the first four number and check whether the year is between 1980 and 2018>
return length
Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:
import re
def determine_similarity(string, other):
length_string = len(string) # use len to get the number of characters in the string
length_other = len(other)
number_of_numbers_string = _determine_number_of_numbers(string)
number_of_numbers_other = _determine_number_of_numbers(other)
<some logic here to create a metric of simiarity>
<find the differences and divide them?>
LEADING_NUMBERS = re.compile(
r"^" # anchor at start of string
r"[0-9]" # Must be a number
r"+" # One or more matches
)
def _determine_number_of_numbers(string):
"""
Determine how many LEADING numbers are in a string
"""
match = LEADING_NUMBERS.search(string)
if match is not None:
length = len(match.group()) # Number of numbers is length of number match group
else:
length = 0 # No match means no numbers
<You might want to check whether the numbers constitute a date within a certain range or something like that>
<For example, take the first four number and check whether the year is between 1980 and 2018>
return length
answered Nov 15 '18 at 12:55
NeilNeil
532110
532110
add a comment |
add a comment |
As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).
You could create your own implementation of "edit distance" that matches your custom rules such as:
- first 8 characters are numeric and form valid date
- total string of 15 characters
add a comment |
As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).
You could create your own implementation of "edit distance" that matches your custom rules such as:
- first 8 characters are numeric and form valid date
- total string of 15 characters
add a comment |
As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).
You could create your own implementation of "edit distance" that matches your custom rules such as:
- first 8 characters are numeric and form valid date
- total string of 15 characters
As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).
You could create your own implementation of "edit distance" that matches your custom rules such as:
- first 8 characters are numeric and form valid date
- total string of 15 characters
edited Nov 15 '18 at 13:05
answered Nov 15 '18 at 12:57
jrshjrsh
1269
1269
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319564%2fcheck-format-similarity-between-two-strings%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
What do you mean by "format similarity"? Is Levenshtein distance enough?
– JETM
Nov 15 '18 at 12:39
Have you tried this stackoverflow.com/a/17388505/8835357
– specbug
Nov 15 '18 at 12:41
Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal.
– quant
Nov 15 '18 at 12:57
They aren't both 15 characters long.
– Neil
Nov 15 '18 at 13:01