How to grab internal text from matching lines in multiline text in python?

I have text file called test.txt. From test.txt, I want to grab the lines that start with >lcl then to extract values after locus tag and within close bracket]. I want to do the same thing for values after location. The result I want is shown below. How can I do this in python?

desired result

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

test.txt

>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG

I am new in python so I tried to come up with something like this:

results = 
f = open("test.txt", 'r')

while True:
 line = f.readline()
 if not line:
 break
 file_name = line.split("locus_tag")[-1].strip()
 f.readline() # skip line 
 data_seq1 = f.readline().strip()
 f.readline() 
 data_seq2 = f.readline().strip()
 results.append((file_name, data_seq1))

edited Nov 16 '18 at 3:53

Mad Physicist

38.1k1678110

asked Nov 16 '18 at 2:27

MAPK

1,797835

You forgot to ask a question. What have you tried?

– Lie Ryan
Nov 16 '18 at 2:33

1

@LieRyan Please see my edits.

– MAPK
Nov 16 '18 at 2:36

add a comment |

desired result

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

test.txt

>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG

I am new in python so I tried to come up with something like this:

results = 
f = open("test.txt", 'r')

while True:
 line = f.readline()
 if not line:
 break
 file_name = line.split("locus_tag")[-1].strip()
 f.readline() # skip line 
 data_seq1 = f.readline().strip()
 f.readline() 
 data_seq2 = f.readline().strip()
 results.append((file_name, data_seq1))

edited Nov 16 '18 at 3:53

Mad Physicist

38.1k1678110

asked Nov 16 '18 at 2:27

MAPK

1,797835

You forgot to ask a question. What have you tried?

– Lie Ryan
Nov 16 '18 at 2:33

1

@LieRyan Please see my edits.

– MAPK
Nov 16 '18 at 2:36

add a comment |

desired result

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

test.txt

>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG

I am new in python so I tried to come up with something like this:

results = 
f = open("test.txt", 'r')

while True:
 line = f.readline()
 if not line:
 break
 file_name = line.split("locus_tag")[-1].strip()
 f.readline() # skip line 
 data_seq1 = f.readline().strip()
 f.readline() 
 data_seq2 = f.readline().strip()
 results.append((file_name, data_seq1))

edited Nov 16 '18 at 3:53

Mad Physicist

38.1k1678110

asked Nov 16 '18 at 2:27

MAPK

1,797835

desired result

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

test.txt

>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG

I am new in python so I tried to come up with something like this:

results = 
f = open("test.txt", 'r')

while True:
 line = f.readline()
 if not line:
 break
 file_name = line.split("locus_tag")[-1].strip()
 f.readline() # skip line 
 data_seq1 = f.readline().strip()
 f.readline() 
 data_seq2 = f.readline().strip()
 results.append((file_name, data_seq1))

python

edited Nov 16 '18 at 3:53

Mad Physicist

38.1k1678110

asked Nov 16 '18 at 2:27

MAPK

1,797835

edited Nov 16 '18 at 3:53

Mad Physicist

38.1k1678110

asked Nov 16 '18 at 2:27

MAPK

1,797835

edited Nov 16 '18 at 3:53

Mad Physicist

38.1k1678110

edited Nov 16 '18 at 3:53

Mad Physicist

38.1k1678110

edited Nov 16 '18 at 3:53

Mad Physicist

38.1k1678110

asked Nov 16 '18 at 2:27

MAPK

1,797835

asked Nov 16 '18 at 2:27

MAPK

1,797835

asked Nov 16 '18 at 2:27

MAPK

1,797835

You forgot to ask a question. What have you tried?

– Lie Ryan
Nov 16 '18 at 2:33

1

@LieRyan Please see my edits.

– MAPK
Nov 16 '18 at 2:36

add a comment |

You forgot to ask a question. What have you tried?

– Lie Ryan
Nov 16 '18 at 2:33

1

@LieRyan Please see my edits.

– MAPK
Nov 16 '18 at 2:36

You forgot to ask a question. What have you tried?

– Lie Ryan
Nov 16 '18 at 2:33

@LieRyan Please see my edits.

– MAPK
Nov 16 '18 at 2:36

add a comment |

2 Answers
2

active

oldest

votes

I think, the most trival way to solve your issue is by using regex like this example:

import re

results = 
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
 # Read the entire file as a one string
 data = f_file.read()
 # Here we search for the string that begins with '>lcl'
 # and in which we find the [locus_tag=...] and [localtion=...]
 results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

for locus, location in results:
 print(locus, location)

Output:

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

Another variation using a dict as a result and by splitting lines:

import re

results = 
with open('fichier1', 'r') as f_file:
 # Here we split the file's lines into a list
 data = f_file.readlines()
 for line in data:
 # Here we search for the lines that begins by '>lcl'
 # and same as the first attempt
 results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

for locus, location in results.items():
 print(locus, location)

Edit: Creating a DataFrame and exporting it into a csv file:

import re
from pandas import DataFrame as df

results = 
with open('fichier1', 'r') as f_file:
 data = f_file.readlines()
 for line in data:
 results.update(re.findall(
 r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
 line
 ))

df_ = df(
 list(results.items()),
 index=range(1, len(results) + 1),
 columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')

It will prints and creates a file called results.csv:

 locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298

edited Nov 16 '18 at 3:21

answered Nov 16 '18 at 2:40

Chiheb Nexus

5,30631829

Can't argue with those results.

– Mad Physicist
Nov 16 '18 at 2:45

3

@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

– Mad Physicist
Nov 16 '18 at 2:46

@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

– Chiheb Nexus
Nov 16 '18 at 2:48

1

@MAPK See my last edit.

– Chiheb Nexus
Nov 16 '18 at 3:21

1

No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

– Mad Physicist
Nov 16 '18 at 3:50

|
show 3 more comments

I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.

Generic Regex Solution

import re

def get_tags(filename, tags, prefix='>lcl'):
 tags = set(tags)
 pattern = re.compile(r'[(.+?)=(.+?)]')

 def parse_line(line):
 return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

 with open(filename) as f:
 return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]

This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:

tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)

You could use the result to get the exact printout you want:

for line in get_tags('test.txt', tags):
 print(*(line[tag] for tag in tags))

This has the advantage that you can use the results as you chose later, and configure which tags you extract.

No Regex Solution

This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.

def get_tags2(filename, tags, prefix='>lcl'):
 tags = set(tags)

 def parse_line(line):
 items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
 return dict(tag for tag in items if tag[0] in tags)

 with open(filename) as f:
 return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]

This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.

Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL

edited Nov 16 '18 at 4:06

answered Nov 16 '18 at 3:26

Mad Physicist

38.1k1678110

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330584%2fhow-to-grab-internal-text-from-matching-lines-in-multiline-text-in-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I think, the most trival way to solve your issue is by using regex like this example:

import re

results = 
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
 # Read the entire file as a one string
 data = f_file.read()
 # Here we search for the string that begins with '>lcl'
 # and in which we find the [locus_tag=...] and [localtion=...]
 results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

for locus, location in results:
 print(locus, location)

Output:

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

Another variation using a dict as a result and by splitting lines:

import re

results = 
with open('fichier1', 'r') as f_file:
 # Here we split the file's lines into a list
 data = f_file.readlines()
 for line in data:
 # Here we search for the lines that begins by '>lcl'
 # and same as the first attempt
 results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

for locus, location in results.items():
 print(locus, location)

Edit: Creating a DataFrame and exporting it into a csv file:

import re
from pandas import DataFrame as df

results = 
with open('fichier1', 'r') as f_file:
 data = f_file.readlines()
 for line in data:
 results.update(re.findall(
 r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
 line
 ))

df_ = df(
 list(results.items()),
 index=range(1, len(results) + 1),
 columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')

It will prints and creates a file called results.csv:

 locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298

edited Nov 16 '18 at 3:21

answered Nov 16 '18 at 2:40

Chiheb Nexus

5,30631829

Can't argue with those results.

– Mad Physicist
Nov 16 '18 at 2:45

3

@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

– Mad Physicist
Nov 16 '18 at 2:46

@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

– Chiheb Nexus
Nov 16 '18 at 2:48

1

@MAPK See my last edit.

– Chiheb Nexus
Nov 16 '18 at 3:21

1

No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

– Mad Physicist
Nov 16 '18 at 3:50

|
show 3 more comments

I think, the most trival way to solve your issue is by using regex like this example:

import re

results = 
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
 # Read the entire file as a one string
 data = f_file.read()
 # Here we search for the string that begins with '>lcl'
 # and in which we find the [locus_tag=...] and [localtion=...]
 results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

for locus, location in results:
 print(locus, location)

Output:

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

Another variation using a dict as a result and by splitting lines:

import re

results = 
with open('fichier1', 'r') as f_file:
 # Here we split the file's lines into a list
 data = f_file.readlines()
 for line in data:
 # Here we search for the lines that begins by '>lcl'
 # and same as the first attempt
 results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

for locus, location in results.items():
 print(locus, location)

Edit: Creating a DataFrame and exporting it into a csv file:

import re
from pandas import DataFrame as df

results = 
with open('fichier1', 'r') as f_file:
 data = f_file.readlines()
 for line in data:
 results.update(re.findall(
 r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
 line
 ))

df_ = df(
 list(results.items()),
 index=range(1, len(results) + 1),
 columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')

It will prints and creates a file called results.csv:

 locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298

edited Nov 16 '18 at 3:21

answered Nov 16 '18 at 2:40

Chiheb Nexus

5,30631829

Can't argue with those results.

– Mad Physicist
Nov 16 '18 at 2:45

3

@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

– Mad Physicist
Nov 16 '18 at 2:46

@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

– Chiheb Nexus
Nov 16 '18 at 2:48

1

@MAPK See my last edit.

– Chiheb Nexus
Nov 16 '18 at 3:21

1

No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

– Mad Physicist
Nov 16 '18 at 3:50

|
show 3 more comments

I think, the most trival way to solve your issue is by using regex like this example:

import re

results = 
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
 # Read the entire file as a one string
 data = f_file.read()
 # Here we search for the string that begins with '>lcl'
 # and in which we find the [locus_tag=...] and [localtion=...]
 results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

for locus, location in results:
 print(locus, location)

Output:

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

Another variation using a dict as a result and by splitting lines:

import re

results = 
with open('fichier1', 'r') as f_file:
 # Here we split the file's lines into a list
 data = f_file.readlines()
 for line in data:
 # Here we search for the lines that begins by '>lcl'
 # and same as the first attempt
 results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

for locus, location in results.items():
 print(locus, location)

Edit: Creating a DataFrame and exporting it into a csv file:

import re
from pandas import DataFrame as df

results = 
with open('fichier1', 'r') as f_file:
 data = f_file.readlines()
 for line in data:
 results.update(re.findall(
 r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
 line
 ))

df_ = df(
 list(results.items()),
 index=range(1, len(results) + 1),
 columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')

It will prints and creates a file called results.csv:

 locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298

edited Nov 16 '18 at 3:21

answered Nov 16 '18 at 2:40

Chiheb Nexus

5,30631829

I think, the most trival way to solve your issue is by using regex like this example:

import re

results = 
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
 # Read the entire file as a one string
 data = f_file.read()
 # Here we search for the string that begins with '>lcl'
 # and in which we find the [locus_tag=...] and [localtion=...]
 results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)

for locus, location in results:
 print(locus, location)

Output:

SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)

Another variation using a dict as a result and by splitting lines:

import re

results = 
with open('fichier1', 'r') as f_file:
 # Here we split the file's lines into a list
 data = f_file.readlines()
 for line in data:
 # Here we search for the lines that begins by '>lcl'
 # and same as the first attempt
 results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))

for locus, location in results.items():
 print(locus, location)

Edit: Creating a DataFrame and exporting it into a csv file:

import re
from pandas import DataFrame as df

results = 
with open('fichier1', 'r') as f_file:
 data = f_file.readlines()
 for line in data:
 results.update(re.findall(
 r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
 line
 ))

df_ = df(
 list(results.items()),
 index=range(1, len(results) + 1),
 columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')

It will prints and creates a file called results.csv:

 locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298

edited Nov 16 '18 at 3:21

answered Nov 16 '18 at 2:40

Chiheb Nexus

5,30631829

edited Nov 16 '18 at 3:21

answered Nov 16 '18 at 2:40

Chiheb Nexus

5,30631829

answered Nov 16 '18 at 2:40

Chiheb Nexus

5,30631829

answered Nov 16 '18 at 2:40

Chiheb Nexus

5,30631829

Can't argue with those results.

– Mad Physicist
Nov 16 '18 at 2:45

3

@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

– Mad Physicist
Nov 16 '18 at 2:46

@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

– Chiheb Nexus
Nov 16 '18 at 2:48

1

@MAPK See my last edit.

– Chiheb Nexus
Nov 16 '18 at 3:21

1

No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

– Mad Physicist
Nov 16 '18 at 3:50

|
show 3 more comments

Can't argue with those results.

– Mad Physicist
Nov 16 '18 at 2:45

3

@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

– Mad Physicist
Nov 16 '18 at 2:46

@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

– Chiheb Nexus
Nov 16 '18 at 2:48

1

@MAPK See my last edit.

– Chiheb Nexus
Nov 16 '18 at 3:21

1

No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

– Mad Physicist
Nov 16 '18 at 3:50

Can't argue with those results.

– Mad Physicist
Nov 16 '18 at 2:45

@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.

– Mad Physicist
Nov 16 '18 at 2:46

@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?

– Chiheb Nexus
Nov 16 '18 at 2:48

@MAPK See my last edit.

– Chiheb Nexus
Nov 16 '18 at 3:21

No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)

– Mad Physicist
Nov 16 '18 at 3:50

|
show 3 more comments

Generic Regex Solution

import re

def get_tags(filename, tags, prefix='>lcl'):
 tags = set(tags)
 pattern = re.compile(r'[(.+?)=(.+?)]')

 def parse_line(line):
 return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

 with open(filename) as f:
 return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]

This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:

tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)

You could use the result to get the exact printout you want:

for line in get_tags('test.txt', tags):
 print(*(line[tag] for tag in tags))

This has the advantage that you can use the results as you chose later, and configure which tags you extract.

No Regex Solution

This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.

def get_tags2(filename, tags, prefix='>lcl'):
 tags = set(tags)

 def parse_line(line):
 items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
 return dict(tag for tag in items if tag[0] in tags)

 with open(filename) as f:
 return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]

Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL

edited Nov 16 '18 at 4:06

answered Nov 16 '18 at 3:26

Mad Physicist

38.1k1678110

add a comment |

Generic Regex Solution

import re

def get_tags(filename, tags, prefix='>lcl'):
 tags = set(tags)
 pattern = re.compile(r'[(.+?)=(.+?)]')

 def parse_line(line):
 return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

 with open(filename) as f:
 return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]

This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:

tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)

You could use the result to get the exact printout you want:

for line in get_tags('test.txt', tags):
 print(*(line[tag] for tag in tags))

This has the advantage that you can use the results as you chose later, and configure which tags you extract.

No Regex Solution

This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.

def get_tags2(filename, tags, prefix='>lcl'):
 tags = set(tags)

 def parse_line(line):
 items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
 return dict(tag for tag in items if tag[0] in tags)

 with open(filename) as f:
 return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]

Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL

edited Nov 16 '18 at 4:06

answered Nov 16 '18 at 3:26

Mad Physicist

38.1k1678110

add a comment |

Generic Regex Solution

import re

def get_tags(filename, tags, prefix='>lcl'):
 tags = set(tags)
 pattern = re.compile(r'[(.+?)=(.+?)]')

 def parse_line(line):
 return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

 with open(filename) as f:
 return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]

This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:

tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)

You could use the result to get the exact printout you want:

for line in get_tags('test.txt', tags):
 print(*(line[tag] for tag in tags))

This has the advantage that you can use the results as you chose later, and configure which tags you extract.

No Regex Solution

This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.

def get_tags2(filename, tags, prefix='>lcl'):
 tags = set(tags)

 def parse_line(line):
 items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
 return dict(tag for tag in items if tag[0] in tags)

 with open(filename) as f:
 return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]

Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL

edited Nov 16 '18 at 4:06

answered Nov 16 '18 at 3:26

Mad Physicist

38.1k1678110

Generic Regex Solution

import re

def get_tags(filename, tags, prefix='>lcl'):
 tags = set(tags)
 pattern = re.compile(r'[(.+?)=(.+?)]')

 def parse_line(line):
 return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags

 with open(filename) as f:
 return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]

This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:

tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)

You could use the result to get the exact printout you want:

for line in get_tags('test.txt', tags):
 print(*(line[tag] for tag in tags))

This has the advantage that you can use the results as you chose later, and configure which tags you extract.

No Regex Solution

This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.

def get_tags2(filename, tags, prefix='>lcl'):
 tags = set(tags)

 def parse_line(line):
 items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
 return dict(tag for tag in items if tag[0] in tags)

 with open(filename) as f:
 return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]

Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL

edited Nov 16 '18 at 4:06

answered Nov 16 '18 at 3:26

Mad Physicist

38.1k1678110

edited Nov 16 '18 at 4:06

answered Nov 16 '18 at 3:26

Mad Physicist

38.1k1678110

answered Nov 16 '18 at 3:26

Mad Physicist

38.1k1678110

answered Nov 16 '18 at 3:26

Mad Physicist

38.1k1678110

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth