How to grab internal text from matching lines in multiline text in python?

Multi tool use
I have text file called test.txt
. From test.txt
, I want to grab the lines that start with >lcl
then to extract values after locus
tag and within close bracket]
. I want to do the same thing for values after location
. The result I want is shown below. How can I do this in python?
desired result
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
test.txt
>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG
I am new in python so I tried to come up with something like this:
results =
f = open("test.txt", 'r')
while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))
python
add a comment |
I have text file called test.txt
. From test.txt
, I want to grab the lines that start with >lcl
then to extract values after locus
tag and within close bracket]
. I want to do the same thing for values after location
. The result I want is shown below. How can I do this in python?
desired result
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
test.txt
>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG
I am new in python so I tried to come up with something like this:
results =
f = open("test.txt", 'r')
while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))
python
You forgot to ask a question. What have you tried?
– Lie Ryan
Nov 16 '18 at 2:33
1
@LieRyan Please see my edits.
– MAPK
Nov 16 '18 at 2:36
add a comment |
I have text file called test.txt
. From test.txt
, I want to grab the lines that start with >lcl
then to extract values after locus
tag and within close bracket]
. I want to do the same thing for values after location
. The result I want is shown below. How can I do this in python?
desired result
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
test.txt
>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG
I am new in python so I tried to come up with something like this:
results =
f = open("test.txt", 'r')
while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))
python
I have text file called test.txt
. From test.txt
, I want to grab the lines that start with >lcl
then to extract values after locus
tag and within close bracket]
. I want to do the same thing for values after location
. The result I want is shown below. How can I do this in python?
desired result
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
test.txt
>lcl|NW_001820825.1_gene_208 [locus_tag=SS1G_08319] [db_xref=GeneID:5486863] [partial=5',3'] [location=<504653..>506706] [gbkey=Gene]
ATGGGCAAAGCTTCTAGGAATAAGACGAAGCATCGCGCTGATCCTACCGCAAAAACTGTTAAGCCACCCA
CTGACCCAGAGCTTGCAGCAATTCGAGTTAACAAAATTCTGCCAATTCTCCAAGATTTACAAAGTGCAGA
CCAGTCAAAGAGATCAACTGCTGCAACTGCCATTGCGAACCTCGTTGACGATACAAAATGTCGAAAGTTA
TTCTTGAGAGAGCAAATTGTTCGTATTCTACTCGAACAAACCCTTACAGACTCAAGCATGGAAACTAGAA
>lcl|NW_001820817.1_gene_205 [locus_tag=SS1G_12233] [db_xref=GeneID:5483157] [partial=5',3'] [location=complement(<502136..>503461)] [gbkey=Gene]
ATGATCTGTAATACGCTCGGTGTTCCACCCTGCAACAGAATTCTTAAGAAATTCTCCGTTGGCGAGAGTC
GTCTCGAAATTCAAGACTCAGTACGAGGCAAAGATGTCTACATCATTCAATCGGGTGGAGGAAAGGCCAA
TGATCACTTCGTGGATCTTTGCATTATGATCTCCGCATGCAAAACTGGCTCTGCCAAGCGCGTCACTGTC
GTCCTTCCTTTGTTTCCTTATTCACGACAACCTGATCTGCCATACAACAAGATTGGCGCACCACTTGCCA
>lcl|NW_001820834.1_gene_1034 [locus_tag=SS1G_02099] [db_xref=GeneID:5493612] [partial=5',3'] [location=<2692251..>2693298] [gbkey=Gene]
ATGGCTTCTGTTTACAAGTCATTATCAAAGACCTCTGGTCATAAAGAAGAAACCCCGACTGGTGTCAAGA
AAAACAAGCAAAGAGTTTTGATCTTGTCTTCAAGAGGAATAACTTACAGGTATATAAATTTGTACCGATG
CGATGCAAAAAATCGCAGGAAAATGCTAACTCTACAACTTAGACATCGACATCTCCTCAATGACCTTGCG
TCCCTACTTCCCCACGGTAGGAAAGATGCGAAACTCGATACCAAGTCAAAGCTTTATCAATTGAATGAAT
>lcl|NW_001820830.1_gene_400 [locus_tag=SS1G_05227] [db_xref=GeneID:5489764] [partial=5',3'] [location=complement(<1032740..>1033620)] [gbkey=Gene]
ATGGCGGACGGATGTAAGTTAATTGATGTTCCTACTATTCCAGACTAATATTTGTTCTCGTCCCTACAAT
GCATTCGGAACGGATGGTACTCAGTTAACTTTGTAACTAATACAACGTCTAGTAAATGACCAAAGAACTG
I am new in python so I tried to come up with something like this:
results =
f = open("test.txt", 'r')
while True:
line = f.readline()
if not line:
break
file_name = line.split("locus_tag")[-1].strip()
f.readline() # skip line
data_seq1 = f.readline().strip()
f.readline()
data_seq2 = f.readline().strip()
results.append((file_name, data_seq1))
python
python
edited Nov 16 '18 at 3:53


Mad Physicist
38.1k1678110
38.1k1678110
asked Nov 16 '18 at 2:27


MAPKMAPK
1,797835
1,797835
You forgot to ask a question. What have you tried?
– Lie Ryan
Nov 16 '18 at 2:33
1
@LieRyan Please see my edits.
– MAPK
Nov 16 '18 at 2:36
add a comment |
You forgot to ask a question. What have you tried?
– Lie Ryan
Nov 16 '18 at 2:33
1
@LieRyan Please see my edits.
– MAPK
Nov 16 '18 at 2:36
You forgot to ask a question. What have you tried?
– Lie Ryan
Nov 16 '18 at 2:33
You forgot to ask a question. What have you tried?
– Lie Ryan
Nov 16 '18 at 2:33
1
1
@LieRyan Please see my edits.
– MAPK
Nov 16 '18 at 2:36
@LieRyan Please see my edits.
– MAPK
Nov 16 '18 at 2:36
add a comment |
2 Answers
2
active
oldest
votes
I think, the most trival way to solve your issue is by using regex
like this example:
import re
results =
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
# Read the entire file as a one string
data = f_file.read()
# Here we search for the string that begins with '>lcl'
# and in which we find the [locus_tag=...] and [localtion=...]
results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)
for locus, location in results:
print(locus, location)
Output:
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
Another variation using a dict
as a result and by splitting lines:
import re
results =
with open('fichier1', 'r') as f_file:
# Here we split the file's lines into a list
data = f_file.readlines()
for line in data:
# Here we search for the lines that begins by '>lcl'
# and same as the first attempt
results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))
for locus, location in results.items():
print(locus, location)
Edit: Creating a DataFrame
and exporting it into a csv
file:
import re
from pandas import DataFrame as df
results =
with open('fichier1', 'r') as f_file:
data = f_file.readlines()
for line in data:
results.update(re.findall(
r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
line
))
df_ = df(
list(results.items()),
index=range(1, len(results) + 1),
columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')
It will prints and creates a file called results.csv
:
locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298
Can't argue with those results.
– Mad Physicist
Nov 16 '18 at 2:45
3
@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.
– Mad Physicist
Nov 16 '18 at 2:46
@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?
– Chiheb Nexus
Nov 16 '18 at 2:48
1
@MAPK See my last edit.
– Chiheb Nexus
Nov 16 '18 at 3:21
1
No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)
– Mad Physicist
Nov 16 '18 at 3:50
|
show 3 more comments
I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.
Generic Regex Solution
import re
def get_tags(filename, tags, prefix='>lcl'):
tags = set(tags)
pattern = re.compile(r'[(.+?)=(.+?)]')
def parse_line(line):
return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags
with open(filename) as f:
return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]
This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:
tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)
You could use the result to get the exact printout you want:
for line in get_tags('test.txt', tags):
print(*(line[tag] for tag in tags))
This has the advantage that you can use the results as you chose later, and configure which tags you extract.
No Regex Solution
This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.
def get_tags2(filename, tags, prefix='>lcl'):
tags = set(tags)
def parse_line(line):
items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
return dict(tag for tag in items if tag[0] in tags)
with open(filename) as f:
return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]
This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.
Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330584%2fhow-to-grab-internal-text-from-matching-lines-in-multiline-text-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think, the most trival way to solve your issue is by using regex
like this example:
import re
results =
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
# Read the entire file as a one string
data = f_file.read()
# Here we search for the string that begins with '>lcl'
# and in which we find the [locus_tag=...] and [localtion=...]
results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)
for locus, location in results:
print(locus, location)
Output:
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
Another variation using a dict
as a result and by splitting lines:
import re
results =
with open('fichier1', 'r') as f_file:
# Here we split the file's lines into a list
data = f_file.readlines()
for line in data:
# Here we search for the lines that begins by '>lcl'
# and same as the first attempt
results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))
for locus, location in results.items():
print(locus, location)
Edit: Creating a DataFrame
and exporting it into a csv
file:
import re
from pandas import DataFrame as df
results =
with open('fichier1', 'r') as f_file:
data = f_file.readlines()
for line in data:
results.update(re.findall(
r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
line
))
df_ = df(
list(results.items()),
index=range(1, len(results) + 1),
columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')
It will prints and creates a file called results.csv
:
locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298
Can't argue with those results.
– Mad Physicist
Nov 16 '18 at 2:45
3
@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.
– Mad Physicist
Nov 16 '18 at 2:46
@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?
– Chiheb Nexus
Nov 16 '18 at 2:48
1
@MAPK See my last edit.
– Chiheb Nexus
Nov 16 '18 at 3:21
1
No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)
– Mad Physicist
Nov 16 '18 at 3:50
|
show 3 more comments
I think, the most trival way to solve your issue is by using regex
like this example:
import re
results =
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
# Read the entire file as a one string
data = f_file.read()
# Here we search for the string that begins with '>lcl'
# and in which we find the [locus_tag=...] and [localtion=...]
results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)
for locus, location in results:
print(locus, location)
Output:
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
Another variation using a dict
as a result and by splitting lines:
import re
results =
with open('fichier1', 'r') as f_file:
# Here we split the file's lines into a list
data = f_file.readlines()
for line in data:
# Here we search for the lines that begins by '>lcl'
# and same as the first attempt
results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))
for locus, location in results.items():
print(locus, location)
Edit: Creating a DataFrame
and exporting it into a csv
file:
import re
from pandas import DataFrame as df
results =
with open('fichier1', 'r') as f_file:
data = f_file.readlines()
for line in data:
results.update(re.findall(
r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
line
))
df_ = df(
list(results.items()),
index=range(1, len(results) + 1),
columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')
It will prints and creates a file called results.csv
:
locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298
Can't argue with those results.
– Mad Physicist
Nov 16 '18 at 2:45
3
@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.
– Mad Physicist
Nov 16 '18 at 2:46
@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?
– Chiheb Nexus
Nov 16 '18 at 2:48
1
@MAPK See my last edit.
– Chiheb Nexus
Nov 16 '18 at 3:21
1
No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)
– Mad Physicist
Nov 16 '18 at 3:50
|
show 3 more comments
I think, the most trival way to solve your issue is by using regex
like this example:
import re
results =
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
# Read the entire file as a one string
data = f_file.read()
# Here we search for the string that begins with '>lcl'
# and in which we find the [locus_tag=...] and [localtion=...]
results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)
for locus, location in results:
print(locus, location)
Output:
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
Another variation using a dict
as a result and by splitting lines:
import re
results =
with open('fichier1', 'r') as f_file:
# Here we split the file's lines into a list
data = f_file.readlines()
for line in data:
# Here we search for the lines that begins by '>lcl'
# and same as the first attempt
results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))
for locus, location in results.items():
print(locus, location)
Edit: Creating a DataFrame
and exporting it into a csv
file:
import re
from pandas import DataFrame as df
results =
with open('fichier1', 'r') as f_file:
data = f_file.readlines()
for line in data:
results.update(re.findall(
r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
line
))
df_ = df(
list(results.items()),
index=range(1, len(results) + 1),
columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')
It will prints and creates a file called results.csv
:
locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298
I think, the most trival way to solve your issue is by using regex
like this example:
import re
results =
# Open the file in the 'read' mode
# with statement will take care to close the file
with open('YOUR_FILE_PATH', 'r') as f_file:
# Read the entire file as a one string
data = f_file.read()
# Here we search for the string that begins with '>lcl'
# and in which we find the [locus_tag=...] and [localtion=...]
results = re.findall(r'>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', data)
for locus, location in results:
print(locus, location)
Output:
SS1G_08319 <504653..>506706
SS1G_12233 complement(<502136..>503461)
SS1G_02099 <2692251..>2693298
SS1G_05227 complement(<1032740..>1033620)
Another variation using a dict
as a result and by splitting lines:
import re
results =
with open('fichier1', 'r') as f_file:
# Here we split the file's lines into a list
data = f_file.readlines()
for line in data:
# Here we search for the lines that begins by '>lcl'
# and same as the first attempt
results.update(re.findall(r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]', line))
for locus, location in results.items():
print(locus, location)
Edit: Creating a DataFrame
and exporting it into a csv
file:
import re
from pandas import DataFrame as df
results =
with open('fichier1', 'r') as f_file:
data = f_file.readlines()
for line in data:
results.update(re.findall(
r'^>lcl.*[locus_tag=(.*?)].*[location=(.*?)]',
line
))
df_ = df(
list(results.items()),
index=range(1, len(results) + 1),
columns=['locus', 'location']
)
print(df_)
df_.to_csv('results.csv', sep=',')
It will prints and creates a file called results.csv
:
locus location
1 SS1G_12233 complement(<502136..>503461)
2 SS1G_08319 <504653..>506706
3 SS1G_05227 complement(<1032740..>1033620)
4 SS1G_02099 <2692251..>2693298
edited Nov 16 '18 at 3:21
answered Nov 16 '18 at 2:40


Chiheb NexusChiheb Nexus
5,30631829
5,30631829
Can't argue with those results.
– Mad Physicist
Nov 16 '18 at 2:45
3
@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.
– Mad Physicist
Nov 16 '18 at 2:46
@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?
– Chiheb Nexus
Nov 16 '18 at 2:48
1
@MAPK See my last edit.
– Chiheb Nexus
Nov 16 '18 at 3:21
1
No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)
– Mad Physicist
Nov 16 '18 at 3:50
|
show 3 more comments
Can't argue with those results.
– Mad Physicist
Nov 16 '18 at 2:45
3
@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.
– Mad Physicist
Nov 16 '18 at 2:46
@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?
– Chiheb Nexus
Nov 16 '18 at 2:48
1
@MAPK See my last edit.
– Chiheb Nexus
Nov 16 '18 at 3:21
1
No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)
– Mad Physicist
Nov 16 '18 at 3:50
Can't argue with those results.
– Mad Physicist
Nov 16 '18 at 2:45
Can't argue with those results.
– Mad Physicist
Nov 16 '18 at 2:45
3
3
@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.
– Mad Physicist
Nov 16 '18 at 2:46
@MAPK. Python has a really good regex tutorial as part of the official docs: docs.python.org/3/howto/regex.html. You will find that this answer I'd quite simple after reading through it.
– Mad Physicist
Nov 16 '18 at 2:46
@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?
– Chiheb Nexus
Nov 16 '18 at 2:48
@MadPhysicist Thanks for your comment. Can you explain me why you don't argue with the results ?
– Chiheb Nexus
Nov 16 '18 at 2:48
1
1
@MAPK See my last edit.
– Chiheb Nexus
Nov 16 '18 at 3:21
@MAPK See my last edit.
– Chiheb Nexus
Nov 16 '18 at 3:21
1
1
No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)
– Mad Physicist
Nov 16 '18 at 3:50
No shame whatsoever, I assure you. You're doing great as it is, and I'm happy to have introduced you to a new expression. I am guessing that French and Python ate the first two languages :)
– Mad Physicist
Nov 16 '18 at 3:50
|
show 3 more comments
I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.
Generic Regex Solution
import re
def get_tags(filename, tags, prefix='>lcl'):
tags = set(tags)
pattern = re.compile(r'[(.+?)=(.+?)]')
def parse_line(line):
return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags
with open(filename) as f:
return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]
This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:
tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)
You could use the result to get the exact printout you want:
for line in get_tags('test.txt', tags):
print(*(line[tag] for tag in tags))
This has the advantage that you can use the results as you chose later, and configure which tags you extract.
No Regex Solution
This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.
def get_tags2(filename, tags, prefix='>lcl'):
tags = set(tags)
def parse_line(line):
items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
return dict(tag for tag in items if tag[0] in tags)
with open(filename) as f:
return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]
This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.
Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL
add a comment |
I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.
Generic Regex Solution
import re
def get_tags(filename, tags, prefix='>lcl'):
tags = set(tags)
pattern = re.compile(r'[(.+?)=(.+?)]')
def parse_line(line):
return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags
with open(filename) as f:
return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]
This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:
tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)
You could use the result to get the exact printout you want:
for line in get_tags('test.txt', tags):
print(*(line[tag] for tag in tags))
This has the advantage that you can use the results as you chose later, and configure which tags you extract.
No Regex Solution
This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.
def get_tags2(filename, tags, prefix='>lcl'):
tags = set(tags)
def parse_line(line):
items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
return dict(tag for tag in items if tag[0] in tags)
with open(filename) as f:
return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]
This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.
Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL
add a comment |
I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.
Generic Regex Solution
import re
def get_tags(filename, tags, prefix='>lcl'):
tags = set(tags)
pattern = re.compile(r'[(.+?)=(.+?)]')
def parse_line(line):
return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags
with open(filename) as f:
return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]
This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:
tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)
You could use the result to get the exact printout you want:
for line in get_tags('test.txt', tags):
print(*(line[tag] for tag in tags))
This has the advantage that you can use the results as you chose later, and configure which tags you extract.
No Regex Solution
This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.
def get_tags2(filename, tags, prefix='>lcl'):
tags = set(tags)
def parse_line(line):
items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
return dict(tag for tag in items if tag[0] in tags)
with open(filename) as f:
return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]
This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.
Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL
I would like to present two alternative solutions. One that will extract any set of named tags on your line using regular expressions, and another which is a complete travesty but shows a way to do it without regular expressions.
Generic Regex Solution
import re
def get_tags(filename, tags, prefix='>lcl'):
tags = set(tags)
pattern = re.compile(r'[(.+?)=(.+?)]')
def parse_line(line):
return m.group(1): m.group(2) for m in pattern.finditer(line) if m.group(1) in tags
with open(filename) as f:
return [parse_line(line) for line in f if prefix is None or line.startswith(prefix)]
This function returns a list of dictionaries keyed by the tags you are interested in you would use it like this:
tags = ['locus_tag', 'location']
result = get_tags('test.txt', tags)
You could use the result to get the exact printout you want:
for line in get_tags('test.txt', tags):
print(*(line[tag] for tag in tags))
This has the advantage that you can use the results as you chose later, and configure which tags you extract.
No Regex Solution
This version is just something I wrote to show that is possible. Please do not emulate it, as the code is a pointless maintenance burden.
def get_tags2(filename, tags, prefix='>lcl'):
tags = set(tags)
def parse_line(line):
items = [tag.split(']')[0].split('=') for tag in line.split('[')[1:]]
return dict(tag for tag in items if tag[0] in tags)
with open(filename) as f:
return [parse_line(line) for line in data if prefix is None or line.startswith(prefix)]
This function behaves just like the first one but the parsing function is a hot mess by comparison. It's also much less robust, e.g. because it's assumed that all your square brackets are more or less matching.
Here is an IDEOne link showing off both methods: https://ideone.com/X2LKqL
edited Nov 16 '18 at 4:06
answered Nov 16 '18 at 3:26


Mad PhysicistMad Physicist
38.1k1678110
38.1k1678110
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330584%2fhow-to-grab-internal-text-from-matching-lines-in-multiline-text-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
mLBg8 sETYj1 xwA
You forgot to ask a question. What have you tried?
– Lie Ryan
Nov 16 '18 at 2:33
1
@LieRyan Please see my edits.
– MAPK
Nov 16 '18 at 2:36