How to parse only duplicates from a list in Python?

up vote
1
down vote

favorite

If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?

Example of original list:

RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6

For example, SLC35A2, would be in the new list because it repeats 3 times.

Please suggest.

edited Nov 10 at 16:19

ShadowRanger

55.6k44789

asked Nov 10 at 15:49

Michael C.

214

Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51

I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02

add a comment |

up vote
1
down vote

favorite

If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?

Example of original list:

RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6

For example, SLC35A2, would be in the new list because it repeats 3 times.

Please suggest.

edited Nov 10 at 16:19

ShadowRanger

55.6k44789

asked Nov 10 at 15:49

Michael C.

214

Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51

I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02

add a comment |

up vote
1
down vote

favorite

If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?

Example of original list:

RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6

For example, SLC35A2, would be in the new list because it repeats 3 times.

Please suggest.

edited Nov 10 at 16:19

ShadowRanger

55.6k44789

asked Nov 10 at 15:49

Michael C.

214

If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?

Example of original list:

RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6

For example, SLC35A2, would be in the new list because it repeats 3 times.

Please suggest.

python python-3.x duplicates bioinformatics

edited Nov 10 at 16:19

ShadowRanger

55.6k44789

asked Nov 10 at 15:49

Michael C.

214

edited Nov 10 at 16:19

ShadowRanger

55.6k44789

asked Nov 10 at 15:49

Michael C.

214

edited Nov 10 at 16:19

ShadowRanger

55.6k44789

edited Nov 10 at 16:19

ShadowRanger

55.6k44789

edited Nov 10 at 16:19

ShadowRanger

55.6k44789

asked Nov 10 at 15:49

Michael C.

214

asked Nov 10 at 15:49

Michael C.

214

asked Nov 10 at 15:49

Michael C.

214

Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51

I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02

add a comment |

Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51

I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02

Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51

I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02

add a comment |

3 Answers
3

active

oldest

votes

up vote
4
down vote

collections.Counter makes this fast and trivial:

from collections import Counter

# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]

On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dicts, the duplicates list will be ordered in order of first appearance in listOfGenes; on 3.5 and earlier, it will have arbitrary ordering.

answered Nov 10 at 16:19

ShadowRanger

55.6k44789

Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30

add a comment |

up vote
1
down vote

You can do that like so:

listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

genesOccurences = 
for gene in listOfGenes:
 occurence = genesOccurences.get(gene, 0)
 genesOccurences[gene] = occurence + 1

print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.

answered Nov 10 at 16:12

quant

1,42111226

add a comment |

up vote
1
down vote

Time Complexity = O(n)

Space Complexity = O(n)

Code:

def get_duplicates(array):
 seen = set()
 results = set()
 for element in array:
 if element in seen:
 results.add(element)
 else:
 seen.add(element)

 return list(results)

input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)

Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']

edited Nov 12 at 1:48

answered Nov 10 at 20:43

Jai

1,4322616

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240615%2fhow-to-parse-only-duplicates-from-a-list-in-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
4
down vote

collections.Counter makes this fast and trivial:

from collections import Counter

# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]

answered Nov 10 at 16:19

ShadowRanger

55.6k44789

Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30

add a comment |

up vote
4
down vote

collections.Counter makes this fast and trivial:

from collections import Counter

# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]

answered Nov 10 at 16:19

ShadowRanger

55.6k44789

Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30

add a comment |

up vote
4
down vote

collections.Counter makes this fast and trivial:

from collections import Counter

# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]

answered Nov 10 at 16:19

ShadowRanger

55.6k44789

collections.Counter makes this fast and trivial:

from collections import Counter

# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]

answered Nov 10 at 16:19

ShadowRanger

55.6k44789

answered Nov 10 at 16:19

ShadowRanger

55.6k44789

answered Nov 10 at 16:19

ShadowRanger

55.6k44789

answered Nov 10 at 16:19

ShadowRanger

55.6k44789

Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30

add a comment |

Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30

Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30

add a comment |

up vote
1
down vote

You can do that like so:

listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

genesOccurences = 
for gene in listOfGenes:
 occurence = genesOccurences.get(gene, 0)
 genesOccurences[gene] = occurence + 1

print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.

answered Nov 10 at 16:12

quant

1,42111226

add a comment |

up vote
1
down vote

You can do that like so:

listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

genesOccurences = 
for gene in listOfGenes:
 occurence = genesOccurences.get(gene, 0)
 genesOccurences[gene] = occurence + 1

print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.

answered Nov 10 at 16:12

quant

1,42111226

add a comment |

up vote
1
down vote

You can do that like so:

listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

genesOccurences = 
for gene in listOfGenes:
 occurence = genesOccurences.get(gene, 0)
 genesOccurences[gene] = occurence + 1

print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.

answered Nov 10 at 16:12

quant

1,42111226

You can do that like so:

listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

genesOccurences = 
for gene in listOfGenes:
 occurence = genesOccurences.get(gene, 0)
 genesOccurences[gene] = occurence + 1

print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.

answered Nov 10 at 16:12

quant

1,42111226

answered Nov 10 at 16:12

quant

1,42111226

answered Nov 10 at 16:12

quant

1,42111226

answered Nov 10 at 16:12

quant

1,42111226

add a comment |

up vote
1
down vote

Time Complexity = O(n)

Space Complexity = O(n)

Code:

def get_duplicates(array):
 seen = set()
 results = set()
 for element in array:
 if element in seen:
 results.add(element)
 else:
 seen.add(element)

 return list(results)

input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)

Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']

edited Nov 12 at 1:48

answered Nov 10 at 20:43

Jai

1,4322616

add a comment |

up vote
1
down vote

Time Complexity = O(n)

Space Complexity = O(n)

Code:

def get_duplicates(array):
 seen = set()
 results = set()
 for element in array:
 if element in seen:
 results.add(element)
 else:
 seen.add(element)

 return list(results)

input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)

Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']

edited Nov 12 at 1:48

answered Nov 10 at 20:43

Jai

1,4322616

add a comment |

up vote
1
down vote

Time Complexity = O(n)

Space Complexity = O(n)

Code:

def get_duplicates(array):
 seen = set()
 results = set()
 for element in array:
 if element in seen:
 results.add(element)
 else:
 seen.add(element)

 return list(results)

input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)

Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']

edited Nov 12 at 1:48

answered Nov 10 at 20:43

Jai

1,4322616

Time Complexity = O(n)

Space Complexity = O(n)

Code:

def get_duplicates(array):
 seen = set()
 results = set()
 for element in array:
 if element in seen:
 results.add(element)
 else:
 seen.add(element)

 return list(results)

input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)

Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']

edited Nov 12 at 1:48

answered Nov 10 at 20:43

Jai

1,4322616

edited Nov 12 at 1:48

answered Nov 10 at 20:43

Jai

1,4322616

answered Nov 10 at 20:43

Jai

1,4322616

answered Nov 10 at 20:43

Jai

1,4322616

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth