Why is Tesseract less accurate when run via NodeJS versus when run directly from the command line?
up vote
1
down vote
favorite
I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.
If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.
I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.
When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline
(manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.
However, when I run my Javascript file from the command line using node ocr.js
(which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text
(the text from manual.png in it)
var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');
Tesseract.recognize(filename)
.then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err)
if (err)
console.log(err);
else
console.log("file written successfully");
)
process.exit(0)
)
It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.
All ideas are welcome. Thanks.
Also, I have a follow up question that will show my inexperience.
On Tesseract.recognize(filename)
- what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)
javascript node.js tesseract
|
show 3 more comments
up vote
1
down vote
favorite
I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.
If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.
I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.
When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline
(manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.
However, when I run my Javascript file from the command line using node ocr.js
(which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text
(the text from manual.png in it)
var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');
Tesseract.recognize(filename)
.then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err)
if (err)
console.log(err);
else
console.log("file written successfully");
)
process.exit(0)
)
It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.
All ideas are welcome. Thanks.
Also, I have a follow up question that will show my inexperience.
On Tesseract.recognize(filename)
- what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)
javascript node.js tesseract
Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49
Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53
Sincetesseract.js
is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that useschild_process
to call the CLI version.
– robertklep
Nov 10 at 15:58
If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00
Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00
|
show 3 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.
If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.
I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.
When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline
(manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.
However, when I run my Javascript file from the command line using node ocr.js
(which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text
(the text from manual.png in it)
var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');
Tesseract.recognize(filename)
.then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err)
if (err)
console.log(err);
else
console.log("file written successfully");
)
process.exit(0)
)
It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.
All ideas are welcome. Thanks.
Also, I have a follow up question that will show my inexperience.
On Tesseract.recognize(filename)
- what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)
javascript node.js tesseract
I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.
If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.
I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.
When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline
(manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.
However, when I run my Javascript file from the command line using node ocr.js
(which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text
(the text from manual.png in it)
var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');
Tesseract.recognize(filename)
.then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err)
if (err)
console.log(err);
else
console.log("file written successfully");
)
process.exit(0)
)
It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.
All ideas are welcome. Thanks.
Also, I have a follow up question that will show my inexperience.
On Tesseract.recognize(filename)
- what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)
javascript node.js tesseract
javascript node.js tesseract
asked Nov 10 at 15:44
Frankie
61
61
Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49
Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53
Sincetesseract.js
is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that useschild_process
to call the CLI version.
– robertklep
Nov 10 at 15:58
If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00
Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00
|
show 3 more comments
Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49
Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53
Sincetesseract.js
is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that useschild_process
to call the CLI version.
– robertklep
Nov 10 at 15:58
If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00
Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00
Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49
Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49
Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53
Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53
Since
tesseract.js
is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process
to call the CLI version.– robertklep
Nov 10 at 15:58
Since
tesseract.js
is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process
to call the CLI version.– robertklep
Nov 10 at 15:58
If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00
If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00
Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00
Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00
|
show 3 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240568%2fwhy-is-tesseract-less-accurate-when-run-via-nodejs-versus-when-run-directly-from%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49
Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53
Since
tesseract.js
is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that useschild_process
to call the CLI version.– robertklep
Nov 10 at 15:58
If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00
Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00