How to parse only duplicates from a list in Python?
up vote
1
down vote
favorite
If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?
Example of original list:
RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6
For example, SLC35A2
, would be in the new list because it repeats 3 times.
Please suggest.
python python-3.x duplicates bioinformatics
add a comment |
up vote
1
down vote
favorite
If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?
Example of original list:
RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6
For example, SLC35A2
, would be in the new list because it repeats 3 times.
Please suggest.
python python-3.x duplicates bioinformatics
Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51
I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?
Example of original list:
RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6
For example, SLC35A2
, would be in the new list because it repeats 3 times.
Please suggest.
python python-3.x duplicates bioinformatics
If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?
Example of original list:
RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6
For example, SLC35A2
, would be in the new list because it repeats 3 times.
Please suggest.
python python-3.x duplicates bioinformatics
python python-3.x duplicates bioinformatics
edited Nov 10 at 16:19
ShadowRanger
55.6k44789
55.6k44789
asked Nov 10 at 15:49
Michael C.
214
214
Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51
I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02
add a comment |
Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51
I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02
Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51
Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51
I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02
I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02
add a comment |
3 Answers
3
active
oldest
votes
up vote
4
down vote
collections.Counter
makes this fast and trivial:
from collections import Counter
# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()
# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]
On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dict
s, the duplicates
list
will be ordered in order of first appearance in listOfGenes
; on 3.5 and earlier, it will have arbitrary ordering.
Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30
add a comment |
up vote
1
down vote
You can do that like so:
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")
genesOccurences =
for gene in listOfGenes:
occurence = genesOccurences.get(gene, 0)
genesOccurences[gene] = occurence + 1
print(genesOccurences) # will print a dictionary with every gene and how often it is occurring
filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.
add a comment |
up vote
1
down vote
- Time Complexity =
O(n)
- Space Complexity =
O(n)
Code:
def get_duplicates(array):
seen = set()
results = set()
for element in array:
if element in seen:
results.add(element)
else:
seen.add(element)
return list(results)
input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
collections.Counter
makes this fast and trivial:
from collections import Counter
# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()
# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]
On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dict
s, the duplicates
list
will be ordered in order of first appearance in listOfGenes
; on 3.5 and earlier, it will have arbitrary ordering.
Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30
add a comment |
up vote
4
down vote
collections.Counter
makes this fast and trivial:
from collections import Counter
# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()
# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]
On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dict
s, the duplicates
list
will be ordered in order of first appearance in listOfGenes
; on 3.5 and earlier, it will have arbitrary ordering.
Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30
add a comment |
up vote
4
down vote
up vote
4
down vote
collections.Counter
makes this fast and trivial:
from collections import Counter
# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()
# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]
On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dict
s, the duplicates
list
will be ordered in order of first appearance in listOfGenes
; on 3.5 and earlier, it will have arbitrary ordering.
collections.Counter
makes this fast and trivial:
from collections import Counter
# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()
# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]
On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dict
s, the duplicates
list
will be ordered in order of first appearance in listOfGenes
; on 3.5 and earlier, it will have arbitrary ordering.
answered Nov 10 at 16:19
ShadowRanger
55.6k44789
55.6k44789
Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30
add a comment |
Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30
Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30
Nice answer, didn't knew of the collections framework ;)
– quant
Nov 10 at 16:30
add a comment |
up vote
1
down vote
You can do that like so:
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")
genesOccurences =
for gene in listOfGenes:
occurence = genesOccurences.get(gene, 0)
genesOccurences[gene] = occurence + 1
print(genesOccurences) # will print a dictionary with every gene and how often it is occurring
filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.
add a comment |
up vote
1
down vote
You can do that like so:
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")
genesOccurences =
for gene in listOfGenes:
occurence = genesOccurences.get(gene, 0)
genesOccurences[gene] = occurence + 1
print(genesOccurences) # will print a dictionary with every gene and how often it is occurring
filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.
add a comment |
up vote
1
down vote
up vote
1
down vote
You can do that like so:
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")
genesOccurences =
for gene in listOfGenes:
occurence = genesOccurences.get(gene, 0)
genesOccurences[gene] = occurence + 1
print(genesOccurences) # will print a dictionary with every gene and how often it is occurring
filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.
You can do that like so:
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")
genesOccurences =
for gene in listOfGenes:
occurence = genesOccurences.get(gene, 0)
genesOccurences[gene] = occurence + 1
print(genesOccurences) # will print a dictionary with every gene and how often it is occurring
filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.
answered Nov 10 at 16:12
quant
1,42111226
1,42111226
add a comment |
add a comment |
up vote
1
down vote
- Time Complexity =
O(n)
- Space Complexity =
O(n)
Code:
def get_duplicates(array):
seen = set()
results = set()
for element in array:
if element in seen:
results.add(element)
else:
seen.add(element)
return list(results)
input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']
add a comment |
up vote
1
down vote
- Time Complexity =
O(n)
- Space Complexity =
O(n)
Code:
def get_duplicates(array):
seen = set()
results = set()
for element in array:
if element in seen:
results.add(element)
else:
seen.add(element)
return list(results)
input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']
add a comment |
up vote
1
down vote
up vote
1
down vote
- Time Complexity =
O(n)
- Space Complexity =
O(n)
Code:
def get_duplicates(array):
seen = set()
results = set()
for element in array:
if element in seen:
results.add(element)
else:
seen.add(element)
return list(results)
input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']
- Time Complexity =
O(n)
- Space Complexity =
O(n)
Code:
def get_duplicates(array):
seen = set()
results = set()
for element in array:
if element in seen:
results.add(element)
else:
seen.add(element)
return list(results)
input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
input_array = input_array.split()
duplicates = get_duplicates(input_array)
print(duplicates)Output:
['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']
edited Nov 12 at 1:48
answered Nov 10 at 20:43
Jai
1,4322616
1,4322616
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240615%2fhow-to-parse-only-duplicates-from-a-list-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51
I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02