How to grab internal text from matching lines in multiline text in python?










2















I have text file called test.txt. From test.txt, I want to grab the lines that start with >lcl then to extract values after locus tag and within close bracket]. I want to do the same thing for values after location. The result I want is shown below. How can I do this in python?



desired result



SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)


test.txt



>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG


I am new in python so I tried to come up with something like this:



results = 
f = open("test.txt", 'r')

while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))









share|improve this question
























  • You forgot to ask a question. What have you tried?

    – Lie Ryan
    Nov 16 '18 at 2:33






  • 1





    @LieRyan Please see my edits.

    – MAPK
    Nov 16 '18 at 2:36















2















I have text file called test.txt. From test.txt, I want to grab the lines that start with >lcl then to extract values after locus tag and within close bracket]. I want to do the same thing for values after location. The result I want is shown below. How can I do this in python?



desired result



SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)


test.txt



>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG


I am new in python so I tried to come up with something like this:



results = 
f = open("test.txt", 'r')

while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))









share|improve this question
























  • You forgot to ask a question. What have you tried?

    – Lie Ryan
    Nov 16 '18 at 2:33






  • 1





    @LieRyan Please see my edits.

    – MAPK
    Nov 16 '18 at 2:36













2












2








2








I have text file called test.txt. From test.txt, I want to grab the lines that start with >lcl then to extract values after locus tag and within close bracket]. I want to do the same thing for values after location. The result I want is shown below. How can I do this in python?



desired result



SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)


test.txt



>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG


I am new in python so I tried to come up with something like this:



results = 
f = open("test.txt", 'r')

while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))









share|improve this question
















I have text file called test.txt. From test.txt, I want to grab the lines that start with >lcl then to extract values after locus tag and within close bracket]. I want to do the same thing for values after location. The result I want is shown below. How can I do this in python?



desired result



SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)


test.txt



>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG


I am new in python so I tried to come up with something like this:



results = 
f = open("test.txt", 'r')

while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))






python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 '18 at 3:53









Mad Physicist

38.1k1678110




38.1k1678110










asked Nov 16 '18 at 2:27









MAPKMAPK

1,797835




1,797835












  • You forgot to ask a question. What have you tried?

    – Lie Ryan
    Nov 16 '18 at 2:33






  • 1





    @LieRyan Please see my edits.

    – MAPK
    Nov 16 '18 at 2:36

















  • You forgot to ask a question. What have you tried?

    – Lie Ryan
    Nov 16 '18 at 2:33






  • 1





    @LieRyan Please see my edits.

    – MAPK
    Nov 16 '18 at 2:36
















You forgot to ask a question. What have you tried?

– Lie Ryan
Nov 16 '18 at 2:33





You forgot to ask a question. What have you tried?

– Lie Ryan
Nov 16 '18 at 2:33




1




1





@LieRyan Please see my edits.

– MAPK
Nov 16 '18 at 2:36





@LieRyan Please see my edits.

– MAPK
Nov 16 '18 at 2:36












2 Answers
2






active

oldest

votes


















4














I think, the most trival way to solve your issue is by using regex like this example:



import re

results =
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
# Read the entire file as a one string
data = f_file.read()
# Here we search for the string that begins with '>lcl'
# and in which we find the [locus_tag=...] and [localtion=...]
results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

for locus, location in results:
print(locus, location)


Output:



SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)


Another variation using a dict as a result and by splitting lines:



import re

results =
with open('fichier1', 'r') as f_file:
# Here we split the file's lines into a list
data = f_file.readlines()
for line in data:
# Here we search for the lines that begins by '>lcl'
# and same as the first attempt
results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

for locus, location in results.items():
print(locus, location)


Edit: Creating a DataFrame and exporting it into a csv file:



import re
from pandas import DataFrame as df

results =
with open('fichier1', 'r') as f_file:
data = f_file.readlines()
for line in data:
results.update(re.findall(
r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
line
))

df_ = df(
list(results.items()),
index=range(1, len(results) + 1),
columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')


It will prints and creates a file called results.csv:



 locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298





share|improve this answer

























  • Can't argue with those results.

    – Mad Physicist
    Nov 16 '18 at 2:45






  • 3





    @MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

    – Mad Physicist
    Nov 16 '18 at 2:46











  • @MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

    – Chiheb Nexus
    Nov 16 '18 at 2:48






  • 1





    @MAPK See my last edit.

    – Chiheb Nexus
    Nov 16 '18 at 3:21






  • 1





    No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

    – Mad Physicist
    Nov 16 '18 at 3:50



















2














I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.



Generic Regex Solution



import re

def get_tags(filename, tags, prefix='>lcl'):
tags = set(tags)
pattern = re.compile(r'[(.+?)=(.+?)]')

def parse_line(line):
return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

with open(filename) as f:
return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]


This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:



tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)


You could use the result to get the exact printout you want:



for line in get_tags('test.txt', tags):
print(*(line[tag] for tag in tags))


This has the advantage that you can use the results as you chose later, and configure which tags you extract.



No Regex Solution



This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.



def get_tags2(filename, tags, prefix='>lcl'):
tags = set(tags)

def parse_line(line):
items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
return dict(tag for tag in items if tag[0] in tags)

with open(filename) as f:
return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]


This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.



Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330584%2fhow-to-grab-internal-text-from-matching-lines-in-multiline-text-in-python%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4














    I think, the most trival way to solve your issue is by using regex like this example:



    import re

    results =
    # Open the file in the 'read' mode
    # with statement will take care to close the file
    with open('YOUR_FILE_PATH', 'r') as f_file:
    # Read the entire file as a one string
    data = f_file.read()
    # Here we search for the string that begins with '>lcl'
    # and in which we find the [locus_tag=...] and [localtion=...]
    results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

    for locus, location in results:
    print(locus, location)


    Output:



    SS1G_08319 <504653..>506706
    SS1G_12233 complement(<502136..>503461)
    SS1G_02099 <2692251..>2693298
    SS1G_05227 complement(<1032740..>1033620)


    Another variation using a dict as a result and by splitting lines:



    import re

    results =
    with open('fichier1', 'r') as f_file:
    # Here we split the file's lines into a list
    data = f_file.readlines()
    for line in data:
    # Here we search for the lines that begins by '>lcl'
    # and same as the first attempt
    results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

    for locus, location in results.items():
    print(locus, location)


    Edit: Creating a DataFrame and exporting it into a csv file:



    import re
    from pandas import DataFrame as df

    results =
    with open('fichier1', 'r') as f_file:
    data = f_file.readlines()
    for line in data:
    results.update(re.findall(
    r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
    line
    ))

    df_ = df(
    list(results.items()),
    index=range(1, len(results) + 1),
    columns=['locus', 'location']
    )
    print(df_)
    df_.to_csv('results.csv', sep=',')


    It will prints and creates a file called results.csv:



     locus location
    1 SS1G_12233 complement(<502136..>503461)
    2 SS1G_08319 <504653..>506706
    3 SS1G_05227 complement(<1032740..>1033620)
    4 SS1G_02099 <2692251..>2693298





    share|improve this answer

























    • Can't argue with those results.

      – Mad Physicist
      Nov 16 '18 at 2:45






    • 3





      @MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

      – Mad Physicist
      Nov 16 '18 at 2:46











    • @MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

      – Chiheb Nexus
      Nov 16 '18 at 2:48






    • 1





      @MAPK See my last edit.

      – Chiheb Nexus
      Nov 16 '18 at 3:21






    • 1





      No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

      – Mad Physicist
      Nov 16 '18 at 3:50
















    4














    I think, the most trival way to solve your issue is by using regex like this example:



    import re

    results =
    # Open the file in the 'read' mode
    # with statement will take care to close the file
    with open('YOUR_FILE_PATH', 'r') as f_file:
    # Read the entire file as a one string
    data = f_file.read()
    # Here we search for the string that begins with '>lcl'
    # and in which we find the [locus_tag=...] and [localtion=...]
    results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

    for locus, location in results:
    print(locus, location)


    Output:



    SS1G_08319 <504653..>506706
    SS1G_12233 complement(<502136..>503461)
    SS1G_02099 <2692251..>2693298
    SS1G_05227 complement(<1032740..>1033620)


    Another variation using a dict as a result and by splitting lines:



    import re

    results =
    with open('fichier1', 'r') as f_file:
    # Here we split the file's lines into a list
    data = f_file.readlines()
    for line in data:
    # Here we search for the lines that begins by '>lcl'
    # and same as the first attempt
    results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

    for locus, location in results.items():
    print(locus, location)


    Edit: Creating a DataFrame and exporting it into a csv file:



    import re
    from pandas import DataFrame as df

    results =
    with open('fichier1', 'r') as f_file:
    data = f_file.readlines()
    for line in data:
    results.update(re.findall(
    r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
    line
    ))

    df_ = df(
    list(results.items()),
    index=range(1, len(results) + 1),
    columns=['locus', 'location']
    )
    print(df_)
    df_.to_csv('results.csv', sep=',')


    It will prints and creates a file called results.csv:



     locus location
    1 SS1G_12233 complement(<502136..>503461)
    2 SS1G_08319 <504653..>506706
    3 SS1G_05227 complement(<1032740..>1033620)
    4 SS1G_02099 <2692251..>2693298





    share|improve this answer

























    • Can't argue with those results.

      – Mad Physicist
      Nov 16 '18 at 2:45






    • 3





      @MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

      – Mad Physicist
      Nov 16 '18 at 2:46











    • @MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

      – Chiheb Nexus
      Nov 16 '18 at 2:48






    • 1





      @MAPK See my last edit.

      – Chiheb Nexus
      Nov 16 '18 at 3:21






    • 1





      No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

      – Mad Physicist
      Nov 16 '18 at 3:50














    4












    4








    4







    I think, the most trival way to solve your issue is by using regex like this example:



    import re

    results =
    # Open the file in the 'read' mode
    # with statement will take care to close the file
    with open('YOUR_FILE_PATH', 'r') as f_file:
    # Read the entire file as a one string
    data = f_file.read()
    # Here we search for the string that begins with '>lcl'
    # and in which we find the [locus_tag=...] and [localtion=...]
    results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

    for locus, location in results:
    print(locus, location)


    Output:



    SS1G_08319 <504653..>506706
    SS1G_12233 complement(<502136..>503461)
    SS1G_02099 <2692251..>2693298
    SS1G_05227 complement(<1032740..>1033620)


    Another variation using a dict as a result and by splitting lines:



    import re

    results =
    with open('fichier1', 'r') as f_file:
    # Here we split the file's lines into a list
    data = f_file.readlines()
    for line in data:
    # Here we search for the lines that begins by '>lcl'
    # and same as the first attempt
    results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

    for locus, location in results.items():
    print(locus, location)


    Edit: Creating a DataFrame and exporting it into a csv file:



    import re
    from pandas import DataFrame as df

    results =
    with open('fichier1', 'r') as f_file:
    data = f_file.readlines()
    for line in data:
    results.update(re.findall(
    r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
    line
    ))

    df_ = df(
    list(results.items()),
    index=range(1, len(results) + 1),
    columns=['locus', 'location']
    )
    print(df_)
    df_.to_csv('results.csv', sep=',')


    It will prints and creates a file called results.csv:



     locus location
    1 SS1G_12233 complement(<502136..>503461)
    2 SS1G_08319 <504653..>506706
    3 SS1G_05227 complement(<1032740..>1033620)
    4 SS1G_02099 <2692251..>2693298





    share|improve this answer















    I think, the most trival way to solve your issue is by using regex like this example:



    import re

    results =
    # Open the file in the 'read' mode
    # with statement will take care to close the file
    with open('YOUR_FILE_PATH', 'r') as f_file:
    # Read the entire file as a one string
    data = f_file.read()
    # Here we search for the string that begins with '>lcl'
    # and in which we find the [locus_tag=...] and [localtion=...]
    results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

    for locus, location in results:
    print(locus, location)


    Output:



    SS1G_08319 <504653..>506706
    SS1G_12233 complement(<502136..>503461)
    SS1G_02099 <2692251..>2693298
    SS1G_05227 complement(<1032740..>1033620)


    Another variation using a dict as a result and by splitting lines:



    import re

    results =
    with open('fichier1', 'r') as f_file:
    # Here we split the file's lines into a list
    data = f_file.readlines()
    for line in data:
    # Here we search for the lines that begins by '>lcl'
    # and same as the first attempt
    results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

    for locus, location in results.items():
    print(locus, location)


    Edit: Creating a DataFrame and exporting it into a csv file:



    import re
    from pandas import DataFrame as df

    results =
    with open('fichier1', 'r') as f_file:
    data = f_file.readlines()
    for line in data:
    results.update(re.findall(
    r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
    line
    ))

    df_ = df(
    list(results.items()),
    index=range(1, len(results) + 1),
    columns=['locus', 'location']
    )
    print(df_)
    df_.to_csv('results.csv', sep=',')


    It will prints and creates a file called results.csv:



     locus location
    1 SS1G_12233 complement(<502136..>503461)
    2 SS1G_08319 <504653..>506706
    3 SS1G_05227 complement(<1032740..>1033620)
    4 SS1G_02099 <2692251..>2693298






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 16 '18 at 3:21

























    answered Nov 16 '18 at 2:40









    Chiheb NexusChiheb Nexus

    5,30631829




    5,30631829












    • Can't argue with those results.

      – Mad Physicist
      Nov 16 '18 at 2:45






    • 3





      @MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

      – Mad Physicist
      Nov 16 '18 at 2:46











    • @MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

      – Chiheb Nexus
      Nov 16 '18 at 2:48






    • 1





      @MAPK See my last edit.

      – Chiheb Nexus
      Nov 16 '18 at 3:21






    • 1





      No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

      – Mad Physicist
      Nov 16 '18 at 3:50


















    • Can't argue with those results.

      – Mad Physicist
      Nov 16 '18 at 2:45






    • 3





      @MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

      – Mad Physicist
      Nov 16 '18 at 2:46











    • @MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

      – Chiheb Nexus
      Nov 16 '18 at 2:48






    • 1





      @MAPK See my last edit.

      – Chiheb Nexus
      Nov 16 '18 at 3:21






    • 1





      No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

      – Mad Physicist
      Nov 16 '18 at 3:50

















    Can't argue with those results.

    – Mad Physicist
    Nov 16 '18 at 2:45





    Can't argue with those results.

    – Mad Physicist
    Nov 16 '18 at 2:45




    3




    3





    @MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

    – Mad Physicist
    Nov 16 '18 at 2:46





    @MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

    – Mad Physicist
    Nov 16 '18 at 2:46













    @MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

    – Chiheb Nexus
    Nov 16 '18 at 2:48





    @MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

    – Chiheb Nexus
    Nov 16 '18 at 2:48




    1




    1





    @MAPK See my last edit.

    – Chiheb Nexus
    Nov 16 '18 at 3:21





    @MAPK See my last edit.

    – Chiheb Nexus
    Nov 16 '18 at 3:21




    1




    1





    No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

    – Mad Physicist
    Nov 16 '18 at 3:50






    No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

    – Mad Physicist
    Nov 16 '18 at 3:50














    2














    I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.



    Generic Regex Solution



    import re

    def get_tags(filename, tags, prefix='>lcl'):
    tags = set(tags)
    pattern = re.compile(r'[(.+?)=(.+?)]')

    def parse_line(line):
    return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

    with open(filename) as f:
    return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]


    This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:



    tags = ['locus_tag', 'location']
    result = get_tags('test.txt', tags)


    You could use the result to get the exact printout you want:



    for line in get_tags('test.txt', tags):
    print(*(line[tag] for tag in tags))


    This has the advantage that you can use the results as you chose later, and configure which tags you extract.



    No Regex Solution



    This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.



    def get_tags2(filename, tags, prefix='>lcl'):
    tags = set(tags)

    def parse_line(line):
    items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
    return dict(tag for tag in items if tag[0] in tags)

    with open(filename) as f:
    return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]


    This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.



    Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL






    share|improve this answer





























      2














      I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.



      Generic Regex Solution



      import re

      def get_tags(filename, tags, prefix='>lcl'):
      tags = set(tags)
      pattern = re.compile(r'[(.+?)=(.+?)]')

      def parse_line(line):
      return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

      with open(filename) as f:
      return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]


      This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:



      tags = ['locus_tag', 'location']
      result = get_tags('test.txt', tags)


      You could use the result to get the exact printout you want:



      for line in get_tags('test.txt', tags):
      print(*(line[tag] for tag in tags))


      This has the advantage that you can use the results as you chose later, and configure which tags you extract.



      No Regex Solution



      This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.



      def get_tags2(filename, tags, prefix='>lcl'):
      tags = set(tags)

      def parse_line(line):
      items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
      return dict(tag for tag in items if tag[0] in tags)

      with open(filename) as f:
      return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]


      This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.



      Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL






      share|improve this answer



























        2












        2








        2







        I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.



        Generic Regex Solution



        import re

        def get_tags(filename, tags, prefix='>lcl'):
        tags = set(tags)
        pattern = re.compile(r'[(.+?)=(.+?)]')

        def parse_line(line):
        return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

        with open(filename) as f:
        return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]


        This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:



        tags = ['locus_tag', 'location']
        result = get_tags('test.txt', tags)


        You could use the result to get the exact printout you want:



        for line in get_tags('test.txt', tags):
        print(*(line[tag] for tag in tags))


        This has the advantage that you can use the results as you chose later, and configure which tags you extract.



        No Regex Solution



        This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.



        def get_tags2(filename, tags, prefix='>lcl'):
        tags = set(tags)

        def parse_line(line):
        items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
        return dict(tag for tag in items if tag[0] in tags)

        with open(filename) as f:
        return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]


        This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.



        Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL






        share|improve this answer















        I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.



        Generic Regex Solution



        import re

        def get_tags(filename, tags, prefix='>lcl'):
        tags = set(tags)
        pattern = re.compile(r'[(.+?)=(.+?)]')

        def parse_line(line):
        return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

        with open(filename) as f:
        return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]


        This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:



        tags = ['locus_tag', 'location']
        result = get_tags('test.txt', tags)


        You could use the result to get the exact printout you want:



        for line in get_tags('test.txt', tags):
        print(*(line[tag] for tag in tags))


        This has the advantage that you can use the results as you chose later, and configure which tags you extract.



        No Regex Solution



        This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.



        def get_tags2(filename, tags, prefix='>lcl'):
        tags = set(tags)

        def parse_line(line):
        items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
        return dict(tag for tag in items if tag[0] in tags)

        with open(filename) as f:
        return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]


        This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.



        Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 16 '18 at 4:06

























        answered Nov 16 '18 at 3:26









        Mad PhysicistMad Physicist

        38.1k1678110




        38.1k1678110



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330584%2fhow-to-grab-internal-text-from-matching-lines-in-multiline-text-in-python%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Top Tejano songwriter Luis Silva dead of heart attack at 64

            ReactJS Fetched API data displays live - need Data displayed static

            政党