How to parse only duplicates from a list in Python?









up vote
1
down vote

favorite
1












If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?



Example of original list:



RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6


For example, SLC35A2, would be in the new list because it repeats 3 times.



Please suggest.










share|improve this question























  • Are you looking for a language-independent algorithm or for an implementation in a specific language?
    – Christophe Strobbe
    Nov 10 at 15:51










  • I'm sorry I did not clarify, my apologies. I'm working in Python
    – Michael C.
    Nov 10 at 16:02














up vote
1
down vote

favorite
1












If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?



Example of original list:



RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6


For example, SLC35A2, would be in the new list because it repeats 3 times.



Please suggest.










share|improve this question























  • Are you looking for a language-independent algorithm or for an implementation in a specific language?
    – Christophe Strobbe
    Nov 10 at 15:51










  • I'm sorry I did not clarify, my apologies. I'm working in Python
    – Michael C.
    Nov 10 at 16:02












up vote
1
down vote

favorite
1









up vote
1
down vote

favorite
1






1





If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?



Example of original list:



RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6


For example, SLC35A2, would be in the new list because it repeats 3 times.



Please suggest.










share|improve this question















If i have a list of gene names for example, and I want to create a new list containing only repeating genes, how would I do this?



Example of original list:



RGN
RBM10
ARAF
ZNF630
FTSJ1
SLC35A2
SLC35A2
SLC35A2
MAGIX
DGKK
XAGE1B
XAGE1B
SMC1A
FAM120C
CXorf49
CXorf49B
CHIC1
ABCB7
PBDC1
FGF16
ATP7A
CYLC1
TSPAN6
BTK
BTK
TCEAL4
TEX13A
FRMPD3
PRPS1
COL4A6
COL4A6
COL4A6


For example, SLC35A2, would be in the new list because it repeats 3 times.



Please suggest.







python python-3.x duplicates bioinformatics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 at 16:19









ShadowRanger

55.6k44789




55.6k44789










asked Nov 10 at 15:49









Michael C.

214




214











  • Are you looking for a language-independent algorithm or for an implementation in a specific language?
    – Christophe Strobbe
    Nov 10 at 15:51










  • I'm sorry I did not clarify, my apologies. I'm working in Python
    – Michael C.
    Nov 10 at 16:02
















  • Are you looking for a language-independent algorithm or for an implementation in a specific language?
    – Christophe Strobbe
    Nov 10 at 15:51










  • I'm sorry I did not clarify, my apologies. I'm working in Python
    – Michael C.
    Nov 10 at 16:02















Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51




Are you looking for a language-independent algorithm or for an implementation in a specific language?
– Christophe Strobbe
Nov 10 at 15:51












I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02




I'm sorry I did not clarify, my apologies. I'm working in Python
– Michael C.
Nov 10 at 16:02












3 Answers
3






active

oldest

votes

















up vote
4
down vote













collections.Counter makes this fast and trivial:



from collections import Counter

# Using other answer's listOfGenes for convenience
listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

# Actual work is a one-liner; count them all, keep those with count of 2 or more
duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]


On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dicts, the duplicates list will be ordered in order of first appearance in listOfGenes; on 3.5 and earlier, it will have arbitrary ordering.






share|improve this answer




















  • Nice answer, didn't knew of the collections framework ;)
    – quant
    Nov 10 at 16:30

















up vote
1
down vote













You can do that like so:



listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

genesOccurences =
for gene in listOfGenes:
occurence = genesOccurences.get(gene, 0)
genesOccurences[gene] = occurence + 1

print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
print(filteredGeneList) # will print only those genes occurring > 1 times.





share|improve this answer



























    up vote
    1
    down vote













    • Time Complexity = O(n)

    • Space Complexity = O(n)


    • Code:



      def get_duplicates(array):
      seen = set()
      results = set()
      for element in array:
      if element in seen:
      results.add(element)
      else:
      seen.add(element)

      return list(results)

      input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
      input_array = input_array.split()
      duplicates = get_duplicates(input_array)
      print(duplicates)


    • Output:
      ['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']






    share|improve this answer






















      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240615%2fhow-to-parse-only-duplicates-from-a-list-in-python%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      4
      down vote













      collections.Counter makes this fast and trivial:



      from collections import Counter

      # Using other answer's listOfGenes for convenience
      listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

      # Actual work is a one-liner; count them all, keep those with count of 2 or more
      duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]


      On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dicts, the duplicates list will be ordered in order of first appearance in listOfGenes; on 3.5 and earlier, it will have arbitrary ordering.






      share|improve this answer




















      • Nice answer, didn't knew of the collections framework ;)
        – quant
        Nov 10 at 16:30














      up vote
      4
      down vote













      collections.Counter makes this fast and trivial:



      from collections import Counter

      # Using other answer's listOfGenes for convenience
      listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

      # Actual work is a one-liner; count them all, keep those with count of 2 or more
      duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]


      On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dicts, the duplicates list will be ordered in order of first appearance in listOfGenes; on 3.5 and earlier, it will have arbitrary ordering.






      share|improve this answer




















      • Nice answer, didn't knew of the collections framework ;)
        – quant
        Nov 10 at 16:30












      up vote
      4
      down vote










      up vote
      4
      down vote









      collections.Counter makes this fast and trivial:



      from collections import Counter

      # Using other answer's listOfGenes for convenience
      listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

      # Actual work is a one-liner; count them all, keep those with count of 2 or more
      duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]


      On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dicts, the duplicates list will be ordered in order of first appearance in listOfGenes; on 3.5 and earlier, it will have arbitrary ordering.






      share|improve this answer












      collections.Counter makes this fast and trivial:



      from collections import Counter

      # Using other answer's listOfGenes for convenience
      listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split()

      # Actual work is a one-liner; count them all, keep those with count of 2 or more
      duplicates = [gene for gene, cnt in Counter(listOfGenes).items() if cnt >= 2]


      On CPython 3.6 and higher (and all Python interpreters once they reach 3.7) provides insertion ordered dicts, the duplicates list will be ordered in order of first appearance in listOfGenes; on 3.5 and earlier, it will have arbitrary ordering.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Nov 10 at 16:19









      ShadowRanger

      55.6k44789




      55.6k44789











      • Nice answer, didn't knew of the collections framework ;)
        – quant
        Nov 10 at 16:30
















      • Nice answer, didn't knew of the collections framework ;)
        – quant
        Nov 10 at 16:30















      Nice answer, didn't knew of the collections framework ;)
      – quant
      Nov 10 at 16:30




      Nice answer, didn't knew of the collections framework ;)
      – quant
      Nov 10 at 16:30












      up vote
      1
      down vote













      You can do that like so:



      listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

      genesOccurences =
      for gene in listOfGenes:
      occurence = genesOccurences.get(gene, 0)
      genesOccurences[gene] = occurence + 1

      print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

      filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
      print(filteredGeneList) # will print only those genes occurring > 1 times.





      share|improve this answer
























        up vote
        1
        down vote













        You can do that like so:



        listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

        genesOccurences =
        for gene in listOfGenes:
        occurence = genesOccurences.get(gene, 0)
        genesOccurences[gene] = occurence + 1

        print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

        filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
        print(filteredGeneList) # will print only those genes occurring > 1 times.





        share|improve this answer






















          up vote
          1
          down vote










          up vote
          1
          down vote









          You can do that like so:



          listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

          genesOccurences =
          for gene in listOfGenes:
          occurence = genesOccurences.get(gene, 0)
          genesOccurences[gene] = occurence + 1

          print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

          filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
          print(filteredGeneList) # will print only those genes occurring > 1 times.





          share|improve this answer












          You can do that like so:



          listOfGenes = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6".split(" ")

          genesOccurences =
          for gene in listOfGenes:
          occurence = genesOccurences.get(gene, 0)
          genesOccurences[gene] = occurence + 1

          print(genesOccurences) # will print a dictionary with every gene and how often it is occurring

          filteredGeneList = [ key for key in genesOccurences if genesOccurences[key] > 1 ]
          print(filteredGeneList) # will print only those genes occurring > 1 times.






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 10 at 16:12









          quant

          1,42111226




          1,42111226




















              up vote
              1
              down vote













              • Time Complexity = O(n)

              • Space Complexity = O(n)


              • Code:



                def get_duplicates(array):
                seen = set()
                results = set()
                for element in array:
                if element in seen:
                results.add(element)
                else:
                seen.add(element)

                return list(results)

                input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
                input_array = input_array.split()
                duplicates = get_duplicates(input_array)
                print(duplicates)


              • Output:
                ['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']






              share|improve this answer


























                up vote
                1
                down vote













                • Time Complexity = O(n)

                • Space Complexity = O(n)


                • Code:



                  def get_duplicates(array):
                  seen = set()
                  results = set()
                  for element in array:
                  if element in seen:
                  results.add(element)
                  else:
                  seen.add(element)

                  return list(results)

                  input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
                  input_array = input_array.split()
                  duplicates = get_duplicates(input_array)
                  print(duplicates)


                • Output:
                  ['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']






                share|improve this answer
























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  • Time Complexity = O(n)

                  • Space Complexity = O(n)


                  • Code:



                    def get_duplicates(array):
                    seen = set()
                    results = set()
                    for element in array:
                    if element in seen:
                    results.add(element)
                    else:
                    seen.add(element)

                    return list(results)

                    input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
                    input_array = input_array.split()
                    duplicates = get_duplicates(input_array)
                    print(duplicates)


                  • Output:
                    ['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']






                  share|improve this answer














                  • Time Complexity = O(n)

                  • Space Complexity = O(n)


                  • Code:



                    def get_duplicates(array):
                    seen = set()
                    results = set()
                    for element in array:
                    if element in seen:
                    results.add(element)
                    else:
                    seen.add(element)

                    return list(results)

                    input_array = "RGN RBM10 ARAF ZNF630 FTSJ1 SLC35A2 SLC35A2 SLC35A2 MAGIX DGKK XAGE1B XAGE1B SMC1A FAM120C CXorf49 CXorf49B CHIC1 ABCB7 PBDC1 FGF16 ATP7A CYLC1 TSPAN6 BTK BTK TCEAL4 TEX13A FRMPD3 PRPS1 COL4A6 COL4A6 COL4A6"
                    input_array = input_array.split()
                    duplicates = get_duplicates(input_array)
                    print(duplicates)


                  • Output:
                    ['COL4A6', 'SLC35A2', 'XAGE1B', 'BTK']







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 12 at 1:48

























                  answered Nov 10 at 20:43









                  Jai

                  1,4322616




                  1,4322616



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240615%2fhow-to-parse-only-duplicates-from-a-list-in-python%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Top Tejano songwriter Luis Silva dead of heart attack at 64

                      政党

                      天津地下鉄3号線