Why is Tesseract less accurate when run via NodeJS versus when run directly from the command line?









up vote
1
down vote

favorite












I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.



If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.



I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.



When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline (manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.



However, when I run my Javascript file from the command line using node ocr.js (which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text (the text from manual.png in it)



var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');

Tesseract.recognize(filename)
.then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err)
if (err)
console.log(err);
else
console.log("file written successfully");

)
process.exit(0)
)


It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.



All ideas are welcome. Thanks.



Also, I have a follow up question that will show my inexperience.



On Tesseract.recognize(filename) - what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)










share|improve this question





















  • Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
    – robertklep
    Nov 10 at 15:49










  • Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
    – Frankie
    Nov 10 at 15:53











  • Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
    – robertklep
    Nov 10 at 15:58










  • If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
    – robertklep
    Nov 10 at 16:00










  • Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
    – Frankie
    Nov 10 at 16:00














up vote
1
down vote

favorite












I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.



If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.



I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.



When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline (manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.



However, when I run my Javascript file from the command line using node ocr.js (which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text (the text from manual.png in it)



var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');

Tesseract.recognize(filename)
.then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err)
if (err)
console.log(err);
else
console.log("file written successfully");

)
process.exit(0)
)


It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.



All ideas are welcome. Thanks.



Also, I have a follow up question that will show my inexperience.



On Tesseract.recognize(filename) - what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)










share|improve this question





















  • Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
    – robertklep
    Nov 10 at 15:49










  • Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
    – Frankie
    Nov 10 at 15:53











  • Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
    – robertklep
    Nov 10 at 15:58










  • If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
    – robertklep
    Nov 10 at 16:00










  • Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
    – Frankie
    Nov 10 at 16:00












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.



If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.



I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.



When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline (manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.



However, when I run my Javascript file from the command line using node ocr.js (which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text (the text from manual.png in it)



var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');

Tesseract.recognize(filename)
.then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err)
if (err)
console.log(err);
else
console.log("file written successfully");

)
process.exit(0)
)


It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.



All ideas are welcome. Thanks.



Also, I have a follow up question that will show my inexperience.



On Tesseract.recognize(filename) - what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)










share|improve this question













I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.



If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.



I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.



When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline (manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.



However, when I run my Javascript file from the command line using node ocr.js (which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text (the text from manual.png in it)



var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');

Tesseract.recognize(filename)
.then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err)
if (err)
console.log(err);
else
console.log("file written successfully");

)
process.exit(0)
)


It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.



All ideas are welcome. Thanks.



Also, I have a follow up question that will show my inexperience.



On Tesseract.recognize(filename) - what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)







javascript node.js tesseract






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 at 15:44









Frankie

61




61











  • Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
    – robertklep
    Nov 10 at 15:49










  • Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
    – Frankie
    Nov 10 at 15:53











  • Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
    – robertklep
    Nov 10 at 15:58










  • If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
    – robertklep
    Nov 10 at 16:00










  • Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
    – Frankie
    Nov 10 at 16:00
















  • Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
    – robertklep
    Nov 10 at 15:49










  • Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
    – Frankie
    Nov 10 at 15:53











  • Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
    – robertklep
    Nov 10 at 15:58










  • If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
    – robertklep
    Nov 10 at 16:00










  • Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
    – Frankie
    Nov 10 at 16:00















Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49




Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49












Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53





Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53













Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
– robertklep
Nov 10 at 15:58




Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
– robertklep
Nov 10 at 15:58












If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00




If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00












Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00




Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240568%2fwhy-is-tesseract-less-accurate-when-run-via-nodejs-versus-when-run-directly-from%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240568%2fwhy-is-tesseract-less-accurate-when-run-via-nodejs-versus-when-run-directly-from%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Top Tejano songwriter Luis Silva dead of heart attack at 64

政党

天津地下鉄3号線