Merging data based on matching first column in Python










12















I currently have two sets of data files that look like this:



File 1:



test1 ba ab cd dh gf
test2 fa ab cd dh gf
test3 rt ty er wq ee
test4 er rt sf sd sa


and in file 2:



test1 123 344 123
test1 234 567 787
test1 221 344 566
test3 456 121 677


I would like to combine the files based on mathching rows in the first column (so that "tests" match up)



like so:



test1 ba ab cd dh gf 123 344 123
test1 ba ab cd dh gf 234 567 787
test1 ba ab cd dh gf 221 344 566
test3 rt ty er wq ee 456 121 677


I have this Code



def combineFiles(file1,file2,outfile):
def read_file(file):
data =
for line in csv.reader(file):
data[line[0]] = line[1:]
return data
with open(file1, 'r') as f1, open(file2, 'r') as f2:
data1 = read_file(f1)
data2 = read_file(f2)
with open(outfile, 'w') as out:
wtr= csv.writer(out)
for key in data1.keys():
try:
wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key])))
except KeyError:
pass


However the output ends up looking like this:



test1 ba ab cd dh gf 123 344 123
test3 er rt sf sd sa 456 121 677


Can anyone help me with how to make the output so that test1 can be printed all three times?



Much Appreciated










share|improve this question


























    12















    I currently have two sets of data files that look like this:



    File 1:



    test1 ba ab cd dh gf
    test2 fa ab cd dh gf
    test3 rt ty er wq ee
    test4 er rt sf sd sa


    and in file 2:



    test1 123 344 123
    test1 234 567 787
    test1 221 344 566
    test3 456 121 677


    I would like to combine the files based on mathching rows in the first column (so that "tests" match up)



    like so:



    test1 ba ab cd dh gf 123 344 123
    test1 ba ab cd dh gf 234 567 787
    test1 ba ab cd dh gf 221 344 566
    test3 rt ty er wq ee 456 121 677


    I have this Code



    def combineFiles(file1,file2,outfile):
    def read_file(file):
    data =
    for line in csv.reader(file):
    data[line[0]] = line[1:]
    return data
    with open(file1, 'r') as f1, open(file2, 'r') as f2:
    data1 = read_file(f1)
    data2 = read_file(f2)
    with open(outfile, 'w') as out:
    wtr= csv.writer(out)
    for key in data1.keys():
    try:
    wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key])))
    except KeyError:
    pass


    However the output ends up looking like this:



    test1 ba ab cd dh gf 123 344 123
    test3 er rt sf sd sa 456 121 677


    Can anyone help me with how to make the output so that test1 can be printed all three times?



    Much Appreciated










    share|improve this question
























      12












      12








      12


      1






      I currently have two sets of data files that look like this:



      File 1:



      test1 ba ab cd dh gf
      test2 fa ab cd dh gf
      test3 rt ty er wq ee
      test4 er rt sf sd sa


      and in file 2:



      test1 123 344 123
      test1 234 567 787
      test1 221 344 566
      test3 456 121 677


      I would like to combine the files based on mathching rows in the first column (so that "tests" match up)



      like so:



      test1 ba ab cd dh gf 123 344 123
      test1 ba ab cd dh gf 234 567 787
      test1 ba ab cd dh gf 221 344 566
      test3 rt ty er wq ee 456 121 677


      I have this Code



      def combineFiles(file1,file2,outfile):
      def read_file(file):
      data =
      for line in csv.reader(file):
      data[line[0]] = line[1:]
      return data
      with open(file1, 'r') as f1, open(file2, 'r') as f2:
      data1 = read_file(f1)
      data2 = read_file(f2)
      with open(outfile, 'w') as out:
      wtr= csv.writer(out)
      for key in data1.keys():
      try:
      wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key])))
      except KeyError:
      pass


      However the output ends up looking like this:



      test1 ba ab cd dh gf 123 344 123
      test3 er rt sf sd sa 456 121 677


      Can anyone help me with how to make the output so that test1 can be printed all three times?



      Much Appreciated










      share|improve this question














      I currently have two sets of data files that look like this:



      File 1:



      test1 ba ab cd dh gf
      test2 fa ab cd dh gf
      test3 rt ty er wq ee
      test4 er rt sf sd sa


      and in file 2:



      test1 123 344 123
      test1 234 567 787
      test1 221 344 566
      test3 456 121 677


      I would like to combine the files based on mathching rows in the first column (so that "tests" match up)



      like so:



      test1 ba ab cd dh gf 123 344 123
      test1 ba ab cd dh gf 234 567 787
      test1 ba ab cd dh gf 221 344 566
      test3 rt ty er wq ee 456 121 677


      I have this Code



      def combineFiles(file1,file2,outfile):
      def read_file(file):
      data =
      for line in csv.reader(file):
      data[line[0]] = line[1:]
      return data
      with open(file1, 'r') as f1, open(file2, 'r') as f2:
      data1 = read_file(f1)
      data2 = read_file(f2)
      with open(outfile, 'w') as out:
      wtr= csv.writer(out)
      for key in data1.keys():
      try:
      wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key])))
      except KeyError:
      pass


      However the output ends up looking like this:



      test1 ba ab cd dh gf 123 344 123
      test3 er rt sf sd sa 456 121 677


      Can anyone help me with how to make the output so that test1 can be printed all three times?



      Much Appreciated







      python python-3.x python-2.7






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 14 '18 at 2:41









      Rk_23Rk_23

      10117




      10117






















          4 Answers
          4






          active

          oldest

          votes


















          4














          While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code.



          Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the data1 dictionary as you go:



          with open(file1, 'r') as f1, open(file2, 'r') as f2:
          data1 = read_file(f1)
          with open(outfile, 'w') as out:
          wtr = csv.writer(out, delimiter=' ')
          for line in csv.reader(f2, delimiter=' '):
          # only write if there is a corresponding line in file1
          if line[0] in data1:
          # as you write, get the corresponding file1 data
          wtr.writerow(line[0:] + data1[line[0]] + line[1:])





          share|improve this answer
































            7














            You might want to give the Pandas library a try; it makes cases like this easy:



            >>> import pandas as pd
            >>> pd.merge(df1, df2, on='testnum', how='inner')
            testnum 1_x 2_x 3_x 4 5 1_y 2_y 3_y
            0 test1 ba ab cd dh gf 123 344 123
            1 test1 ba ab cd dh gf 234 567 787
            2 test1 ba ab cd dh gf 221 344 566
            3 test3 rt ty er wq ee 456 121 677



            This assumes the test column is named "testnum".



            >>> df1
            testnum 1 2 3 4 5
            0 test1 ba ab cd dh gf
            1 test2 fa ab cd dh gf
            2 test3 rt ty er wq ee
            3 test4 er rt sf sd sa

            >>> df2
            testnum 1 2 3
            0 test1 123 344 123
            1 test1 234 567 787
            2 test1 221 344 566
            3 test3 456 121 677


            You'd read these in with pd.read_csv().






            share|improve this answer
































              2














              The issue is that you are overwriting the keys in the line



              data[line[0]] = line[1:]


              Since your files have non-unique "keys," you could try manually making them unique using enumerate:



              for ind, line in enumerate(csv.reader(file)):
              unique_key = ''.join([line[0], "_", str(ind)])
              data[unique_key] = line[1:]


              Later, when you merge the result, you can strip the keys to remove anything after the underscore:



              wtr.writerow(((key.split("_")[0], ','.join(data1[key]), ','.join(data2[key])))


              To my taste, this is all very clumsy. If your goal is to read, manipulate, and write data to and from csv files, I would recommend looking into pandas, as this code can be written in a few lines using DataFrames (see the answer by Brad Solomon).






              share|improve this answer
































                2














                You can try collecting your items into seperate collections.defaultdict(), then get the cartesian product of the intersecting rows with itertools.product():



                from collections import defaultdict
                from itertools import product

                def collect_rows(file):
                collection = defaultdict(list)

                for line in file:
                col1, *rest = line.split()
                collection[col1].append(rest)

                return collection

                with open("file1.txt") as f1, open("file2.txt") as f2, open("output.txt", "w") as out:
                f1_collection = collect_rows(f1)
                f2_collection = collect_rows(f2)

                # Ordered intersection, no need to sort
                set_2 = set(f2_collection)
                intersection = [key for key in f1_collection if key in set_2]

                for key in intersection:
                for x, y in product(f1_collection[key], f2_collection[key]):
                out.write("%sn" % " ".join([key] + x + y))


                Which gives the following output.txt:



                test1 ba ab cd dh gf 123 344 123
                test1 ba ab cd dh gf 234 567 787
                test1 ba ab cd dh gf 221 344 566
                test3 rt ty er wq ee 456 121 677


                Note: It's probably easier to follow Brad Solomon's Pandas approach, since it can be done with one command.






                share|improve this answer
























                  Your Answer






                  StackExchange.ifUsing("editor", function ()
                  StackExchange.using("externalEditor", function ()
                  StackExchange.using("snippets", function ()
                  StackExchange.snippets.init();
                  );
                  );
                  , "code-snippets");

                  StackExchange.ready(function()
                  var channelOptions =
                  tags: "".split(" "),
                  id: "1"
                  ;
                  initTagRenderer("".split(" "), "".split(" "), channelOptions);

                  StackExchange.using("externalEditor", function()
                  // Have to fire editor after snippets, if snippets enabled
                  if (StackExchange.settings.snippets.snippetsEnabled)
                  StackExchange.using("snippets", function()
                  createEditor();
                  );

                  else
                  createEditor();

                  );

                  function createEditor()
                  StackExchange.prepareEditor(
                  heartbeatType: 'answer',
                  autoActivateHeartbeat: false,
                  convertImagesToLinks: true,
                  noModals: true,
                  showLowRepImageUploadWarning: true,
                  reputationToPostImages: 10,
                  bindNavPrevention: true,
                  postfix: "",
                  imageUploader:
                  brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                  contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                  allowUrls: true
                  ,
                  onDemand: true,
                  discardSelector: ".discard-answer"
                  ,immediatelyShowMarkdownHelp:true
                  );



                  );













                  draft saved

                  draft discarded


















                  StackExchange.ready(
                  function ()
                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292432%2fmerging-data-based-on-matching-first-column-in-python%23new-answer', 'question_page');

                  );

                  Post as a guest















                  Required, but never shown

























                  4 Answers
                  4






                  active

                  oldest

                  votes








                  4 Answers
                  4






                  active

                  oldest

                  votes









                  active

                  oldest

                  votes






                  active

                  oldest

                  votes









                  4














                  While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code.



                  Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the data1 dictionary as you go:



                  with open(file1, 'r') as f1, open(file2, 'r') as f2:
                  data1 = read_file(f1)
                  with open(outfile, 'w') as out:
                  wtr = csv.writer(out, delimiter=' ')
                  for line in csv.reader(f2, delimiter=' '):
                  # only write if there is a corresponding line in file1
                  if line[0] in data1:
                  # as you write, get the corresponding file1 data
                  wtr.writerow(line[0:] + data1[line[0]] + line[1:])





                  share|improve this answer





























                    4














                    While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code.



                    Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the data1 dictionary as you go:



                    with open(file1, 'r') as f1, open(file2, 'r') as f2:
                    data1 = read_file(f1)
                    with open(outfile, 'w') as out:
                    wtr = csv.writer(out, delimiter=' ')
                    for line in csv.reader(f2, delimiter=' '):
                    # only write if there is a corresponding line in file1
                    if line[0] in data1:
                    # as you write, get the corresponding file1 data
                    wtr.writerow(line[0:] + data1[line[0]] + line[1:])





                    share|improve this answer



























                      4












                      4








                      4







                      While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code.



                      Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the data1 dictionary as you go:



                      with open(file1, 'r') as f1, open(file2, 'r') as f2:
                      data1 = read_file(f1)
                      with open(outfile, 'w') as out:
                      wtr = csv.writer(out, delimiter=' ')
                      for line in csv.reader(f2, delimiter=' '):
                      # only write if there is a corresponding line in file1
                      if line[0] in data1:
                      # as you write, get the corresponding file1 data
                      wtr.writerow(line[0:] + data1[line[0]] + line[1:])





                      share|improve this answer















                      While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code.



                      Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the data1 dictionary as you go:



                      with open(file1, 'r') as f1, open(file2, 'r') as f2:
                      data1 = read_file(f1)
                      with open(outfile, 'w') as out:
                      wtr = csv.writer(out, delimiter=' ')
                      for line in csv.reader(f2, delimiter=' '):
                      # only write if there is a corresponding line in file1
                      if line[0] in data1:
                      # as you write, get the corresponding file1 data
                      wtr.writerow(line[0:] + data1[line[0]] + line[1:])






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Nov 15 '18 at 4:47

























                      answered Nov 14 '18 at 3:48









                      sliderslider

                      8,23811129




                      8,23811129























                          7














                          You might want to give the Pandas library a try; it makes cases like this easy:



                          >>> import pandas as pd
                          >>> pd.merge(df1, df2, on='testnum', how='inner')
                          testnum 1_x 2_x 3_x 4 5 1_y 2_y 3_y
                          0 test1 ba ab cd dh gf 123 344 123
                          1 test1 ba ab cd dh gf 234 567 787
                          2 test1 ba ab cd dh gf 221 344 566
                          3 test3 rt ty er wq ee 456 121 677



                          This assumes the test column is named "testnum".



                          >>> df1
                          testnum 1 2 3 4 5
                          0 test1 ba ab cd dh gf
                          1 test2 fa ab cd dh gf
                          2 test3 rt ty er wq ee
                          3 test4 er rt sf sd sa

                          >>> df2
                          testnum 1 2 3
                          0 test1 123 344 123
                          1 test1 234 567 787
                          2 test1 221 344 566
                          3 test3 456 121 677


                          You'd read these in with pd.read_csv().






                          share|improve this answer





























                            7














                            You might want to give the Pandas library a try; it makes cases like this easy:



                            >>> import pandas as pd
                            >>> pd.merge(df1, df2, on='testnum', how='inner')
                            testnum 1_x 2_x 3_x 4 5 1_y 2_y 3_y
                            0 test1 ba ab cd dh gf 123 344 123
                            1 test1 ba ab cd dh gf 234 567 787
                            2 test1 ba ab cd dh gf 221 344 566
                            3 test3 rt ty er wq ee 456 121 677



                            This assumes the test column is named "testnum".



                            >>> df1
                            testnum 1 2 3 4 5
                            0 test1 ba ab cd dh gf
                            1 test2 fa ab cd dh gf
                            2 test3 rt ty er wq ee
                            3 test4 er rt sf sd sa

                            >>> df2
                            testnum 1 2 3
                            0 test1 123 344 123
                            1 test1 234 567 787
                            2 test1 221 344 566
                            3 test3 456 121 677


                            You'd read these in with pd.read_csv().






                            share|improve this answer



























                              7












                              7








                              7







                              You might want to give the Pandas library a try; it makes cases like this easy:



                              >>> import pandas as pd
                              >>> pd.merge(df1, df2, on='testnum', how='inner')
                              testnum 1_x 2_x 3_x 4 5 1_y 2_y 3_y
                              0 test1 ba ab cd dh gf 123 344 123
                              1 test1 ba ab cd dh gf 234 567 787
                              2 test1 ba ab cd dh gf 221 344 566
                              3 test3 rt ty er wq ee 456 121 677



                              This assumes the test column is named "testnum".



                              >>> df1
                              testnum 1 2 3 4 5
                              0 test1 ba ab cd dh gf
                              1 test2 fa ab cd dh gf
                              2 test3 rt ty er wq ee
                              3 test4 er rt sf sd sa

                              >>> df2
                              testnum 1 2 3
                              0 test1 123 344 123
                              1 test1 234 567 787
                              2 test1 221 344 566
                              3 test3 456 121 677


                              You'd read these in with pd.read_csv().






                              share|improve this answer















                              You might want to give the Pandas library a try; it makes cases like this easy:



                              >>> import pandas as pd
                              >>> pd.merge(df1, df2, on='testnum', how='inner')
                              testnum 1_x 2_x 3_x 4 5 1_y 2_y 3_y
                              0 test1 ba ab cd dh gf 123 344 123
                              1 test1 ba ab cd dh gf 234 567 787
                              2 test1 ba ab cd dh gf 221 344 566
                              3 test3 rt ty er wq ee 456 121 677



                              This assumes the test column is named "testnum".



                              >>> df1
                              testnum 1 2 3 4 5
                              0 test1 ba ab cd dh gf
                              1 test2 fa ab cd dh gf
                              2 test3 rt ty er wq ee
                              3 test4 er rt sf sd sa

                              >>> df2
                              testnum 1 2 3
                              0 test1 123 344 123
                              1 test1 234 567 787
                              2 test1 221 344 566
                              3 test3 456 121 677


                              You'd read these in with pd.read_csv().







                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited Nov 14 '18 at 12:48

























                              answered Nov 14 '18 at 2:49









                              Brad SolomonBrad Solomon

                              13.4k73484




                              13.4k73484





















                                  2














                                  The issue is that you are overwriting the keys in the line



                                  data[line[0]] = line[1:]


                                  Since your files have non-unique "keys," you could try manually making them unique using enumerate:



                                  for ind, line in enumerate(csv.reader(file)):
                                  unique_key = ''.join([line[0], "_", str(ind)])
                                  data[unique_key] = line[1:]


                                  Later, when you merge the result, you can strip the keys to remove anything after the underscore:



                                  wtr.writerow(((key.split("_")[0], ','.join(data1[key]), ','.join(data2[key])))


                                  To my taste, this is all very clumsy. If your goal is to read, manipulate, and write data to and from csv files, I would recommend looking into pandas, as this code can be written in a few lines using DataFrames (see the answer by Brad Solomon).






                                  share|improve this answer





























                                    2














                                    The issue is that you are overwriting the keys in the line



                                    data[line[0]] = line[1:]


                                    Since your files have non-unique "keys," you could try manually making them unique using enumerate:



                                    for ind, line in enumerate(csv.reader(file)):
                                    unique_key = ''.join([line[0], "_", str(ind)])
                                    data[unique_key] = line[1:]


                                    Later, when you merge the result, you can strip the keys to remove anything after the underscore:



                                    wtr.writerow(((key.split("_")[0], ','.join(data1[key]), ','.join(data2[key])))


                                    To my taste, this is all very clumsy. If your goal is to read, manipulate, and write data to and from csv files, I would recommend looking into pandas, as this code can be written in a few lines using DataFrames (see the answer by Brad Solomon).






                                    share|improve this answer



























                                      2












                                      2








                                      2







                                      The issue is that you are overwriting the keys in the line



                                      data[line[0]] = line[1:]


                                      Since your files have non-unique "keys," you could try manually making them unique using enumerate:



                                      for ind, line in enumerate(csv.reader(file)):
                                      unique_key = ''.join([line[0], "_", str(ind)])
                                      data[unique_key] = line[1:]


                                      Later, when you merge the result, you can strip the keys to remove anything after the underscore:



                                      wtr.writerow(((key.split("_")[0], ','.join(data1[key]), ','.join(data2[key])))


                                      To my taste, this is all very clumsy. If your goal is to read, manipulate, and write data to and from csv files, I would recommend looking into pandas, as this code can be written in a few lines using DataFrames (see the answer by Brad Solomon).






                                      share|improve this answer















                                      The issue is that you are overwriting the keys in the line



                                      data[line[0]] = line[1:]


                                      Since your files have non-unique "keys," you could try manually making them unique using enumerate:



                                      for ind, line in enumerate(csv.reader(file)):
                                      unique_key = ''.join([line[0], "_", str(ind)])
                                      data[unique_key] = line[1:]


                                      Later, when you merge the result, you can strip the keys to remove anything after the underscore:



                                      wtr.writerow(((key.split("_")[0], ','.join(data1[key]), ','.join(data2[key])))


                                      To my taste, this is all very clumsy. If your goal is to read, manipulate, and write data to and from csv files, I would recommend looking into pandas, as this code can be written in a few lines using DataFrames (see the answer by Brad Solomon).







                                      share|improve this answer














                                      share|improve this answer



                                      share|improve this answer








                                      edited Nov 14 '18 at 3:12

























                                      answered Nov 14 '18 at 2:58









                                      ppinchukppinchuk

                                      813




                                      813





















                                          2














                                          You can try collecting your items into seperate collections.defaultdict(), then get the cartesian product of the intersecting rows with itertools.product():



                                          from collections import defaultdict
                                          from itertools import product

                                          def collect_rows(file):
                                          collection = defaultdict(list)

                                          for line in file:
                                          col1, *rest = line.split()
                                          collection[col1].append(rest)

                                          return collection

                                          with open("file1.txt") as f1, open("file2.txt") as f2, open("output.txt", "w") as out:
                                          f1_collection = collect_rows(f1)
                                          f2_collection = collect_rows(f2)

                                          # Ordered intersection, no need to sort
                                          set_2 = set(f2_collection)
                                          intersection = [key for key in f1_collection if key in set_2]

                                          for key in intersection:
                                          for x, y in product(f1_collection[key], f2_collection[key]):
                                          out.write("%sn" % " ".join([key] + x + y))


                                          Which gives the following output.txt:



                                          test1 ba ab cd dh gf 123 344 123
                                          test1 ba ab cd dh gf 234 567 787
                                          test1 ba ab cd dh gf 221 344 566
                                          test3 rt ty er wq ee 456 121 677


                                          Note: It's probably easier to follow Brad Solomon's Pandas approach, since it can be done with one command.






                                          share|improve this answer





























                                            2














                                            You can try collecting your items into seperate collections.defaultdict(), then get the cartesian product of the intersecting rows with itertools.product():



                                            from collections import defaultdict
                                            from itertools import product

                                            def collect_rows(file):
                                            collection = defaultdict(list)

                                            for line in file:
                                            col1, *rest = line.split()
                                            collection[col1].append(rest)

                                            return collection

                                            with open("file1.txt") as f1, open("file2.txt") as f2, open("output.txt", "w") as out:
                                            f1_collection = collect_rows(f1)
                                            f2_collection = collect_rows(f2)

                                            # Ordered intersection, no need to sort
                                            set_2 = set(f2_collection)
                                            intersection = [key for key in f1_collection if key in set_2]

                                            for key in intersection:
                                            for x, y in product(f1_collection[key], f2_collection[key]):
                                            out.write("%sn" % " ".join([key] + x + y))


                                            Which gives the following output.txt:



                                            test1 ba ab cd dh gf 123 344 123
                                            test1 ba ab cd dh gf 234 567 787
                                            test1 ba ab cd dh gf 221 344 566
                                            test3 rt ty er wq ee 456 121 677


                                            Note: It's probably easier to follow Brad Solomon's Pandas approach, since it can be done with one command.






                                            share|improve this answer



























                                              2












                                              2








                                              2







                                              You can try collecting your items into seperate collections.defaultdict(), then get the cartesian product of the intersecting rows with itertools.product():



                                              from collections import defaultdict
                                              from itertools import product

                                              def collect_rows(file):
                                              collection = defaultdict(list)

                                              for line in file:
                                              col1, *rest = line.split()
                                              collection[col1].append(rest)

                                              return collection

                                              with open("file1.txt") as f1, open("file2.txt") as f2, open("output.txt", "w") as out:
                                              f1_collection = collect_rows(f1)
                                              f2_collection = collect_rows(f2)

                                              # Ordered intersection, no need to sort
                                              set_2 = set(f2_collection)
                                              intersection = [key for key in f1_collection if key in set_2]

                                              for key in intersection:
                                              for x, y in product(f1_collection[key], f2_collection[key]):
                                              out.write("%sn" % " ".join([key] + x + y))


                                              Which gives the following output.txt:



                                              test1 ba ab cd dh gf 123 344 123
                                              test1 ba ab cd dh gf 234 567 787
                                              test1 ba ab cd dh gf 221 344 566
                                              test3 rt ty er wq ee 456 121 677


                                              Note: It's probably easier to follow Brad Solomon's Pandas approach, since it can be done with one command.






                                              share|improve this answer















                                              You can try collecting your items into seperate collections.defaultdict(), then get the cartesian product of the intersecting rows with itertools.product():



                                              from collections import defaultdict
                                              from itertools import product

                                              def collect_rows(file):
                                              collection = defaultdict(list)

                                              for line in file:
                                              col1, *rest = line.split()
                                              collection[col1].append(rest)

                                              return collection

                                              with open("file1.txt") as f1, open("file2.txt") as f2, open("output.txt", "w") as out:
                                              f1_collection = collect_rows(f1)
                                              f2_collection = collect_rows(f2)

                                              # Ordered intersection, no need to sort
                                              set_2 = set(f2_collection)
                                              intersection = [key for key in f1_collection if key in set_2]

                                              for key in intersection:
                                              for x, y in product(f1_collection[key], f2_collection[key]):
                                              out.write("%sn" % " ".join([key] + x + y))


                                              Which gives the following output.txt:



                                              test1 ba ab cd dh gf 123 344 123
                                              test1 ba ab cd dh gf 234 567 787
                                              test1 ba ab cd dh gf 221 344 566
                                              test3 rt ty er wq ee 456 121 677


                                              Note: It's probably easier to follow Brad Solomon's Pandas approach, since it can be done with one command.







                                              share|improve this answer














                                              share|improve this answer



                                              share|improve this answer








                                              edited Nov 14 '18 at 3:37

























                                              answered Nov 14 '18 at 3:05









                                              RoadRunnerRoadRunner

                                              11.2k31340




                                              11.2k31340



























                                                  draft saved

                                                  draft discarded
















































                                                  Thanks for contributing an answer to Stack Overflow!


                                                  • Please be sure to answer the question. Provide details and share your research!

                                                  But avoid


                                                  • Asking for help, clarification, or responding to other answers.

                                                  • Making statements based on opinion; back them up with references or personal experience.

                                                  To learn more, see our tips on writing great answers.




                                                  draft saved


                                                  draft discarded














                                                  StackExchange.ready(
                                                  function ()
                                                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292432%2fmerging-data-based-on-matching-first-column-in-python%23new-answer', 'question_page');

                                                  );

                                                  Post as a guest















                                                  Required, but never shown





















































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown

































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown







                                                  Popular posts from this blog

                                                  Top Tejano songwriter Luis Silva dead of heart attack at 64

                                                  ReactJS Fetched API data displays live - need Data displayed static

                                                  政党