Why is Tesseract less accurate when run via NodeJS versus when run directly from the command line?

up vote
1
down vote

favorite

I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.

If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.

I have read a lot of posts about tips on how to improve accuracy of OCR but that is not the issue here. The issue here is in the fact that the text from the png file I am using CAN be extracted perfectly accurately, but the accuracy varies dramatically depending on which method I use to run Tesseract.

When I run Tesseract directly from the command line using the command tesseract.exe manual.png usingcommandline (manual.png is the file I am extracting from and I am saving it to usingcommandline.txt) - the returned file is very accurate.

However, when I run my Javascript file from the command line using node ocr.js (which runs the code below) on the exact same file, it is far less accurate. It is unusable. (The code below creates a vianodejs.txt file with results.text (the text from manual.png in it)

var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');

Tesseract.recognize(filename)
 .then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err) 
 if (err) 
 console.log(err);
 else 
 console.log("file written successfully");
 
 )
 process.exit(0)
 )

It is the exact same manual.png file I am using in both instances but it is far less accurate when I run Tesseract via the Javascript program. Ultimately I need to run it through the Javascript program for my application to work, and I need the level of accuracy that I get with the other method.

All ideas are welcome. Thanks.

Also, I have a follow up question that will show my inexperience.

On Tesseract.recognize(filename) - what is this called technically? Is it a method (recognize) running on an object (tesseract). Where is the function? Can I put te he whole thing into a function that will run after another step is complete? (The other step is saving the photo in the first place using the webshot library)

asked Nov 10 at 15:44

Frankie

Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49

Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53

Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
– robertklep
Nov 10 at 15:58

If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00

Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00

|
show 3 more comments

up vote
1
down vote

favorite

I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.

If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.

var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');

Tesseract.recognize(filename)
 .then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err) 
 if (err) 
 console.log(err);
 else 
 console.log("file written successfully");
 
 )
 process.exit(0)
 )

All ideas are welcome. Thanks.

Also, I have a follow up question that will show my inexperience.

asked Nov 10 at 15:44

Frankie

Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49

Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53

Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
– robertklep
Nov 10 at 15:58

If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00

Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00

|
show 3 more comments

up vote
1
down vote

favorite

I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.

If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.

var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');

Tesseract.recognize(filename)
 .then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err) 
 if (err) 
 console.log(err);
 else 
 console.log("file written successfully");
 
 )
 process.exit(0)
 )

All ideas are welcome. Thanks.

Also, I have a follow up question that will show my inexperience.

asked Nov 10 at 15:44

Frankie

I am trying to extract text from local PNG images. I'm using Tesseract and running JS files from the command line via node.

If I run Tesseract from the command line it works perfectly, if I run it via Javascript, it does not.

var Tesseract = require('tesseract.js');
var filename = 'manual.png';
var fs = require('fs');

Tesseract.recognize(filename)
 .then(function (result) fs.writeFileSync('vianodejs.txt', result.text, function(err) 
 if (err) 
 console.log(err);
 else 
 console.log("file written successfully");
 
 )
 process.exit(0)
 )

All ideas are welcome. Thanks.

Also, I have a follow up question that will show my inexperience.

javascript node.js tesseract

asked Nov 10 at 15:44

Frankie

asked Nov 10 at 15:44

Frankie

asked Nov 10 at 15:44

Frankie

asked Nov 10 at 15:44

Frankie

asked Nov 10 at 15:44

Frankie

Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49

Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53

Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
– robertklep
Nov 10 at 15:58

If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00

Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00

|
show 3 more comments

Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49

Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53

Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
– robertklep
Nov 10 at 15:58

If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00

Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00

Assuming that the text you're trying to OCR is English, have you tried uploading your picture to tesseract.projectnaptha.com to see the result?
– robertklep
Nov 10 at 15:49

Thanks. I have and it is returning the exact same results as when I run the JS file (the one that is not accurate). This may be an clue in itself. When I downloaded Tesseract, I had difficult loading the language. When it tried to put the eng.traineddata file into the folder where my prject was, it transferred an empty file that was 0 bytes. Someone else had the same probelem and their solution was to download the gzip file again from tessearct and unzip it into the project folder. I wonder could that be related to the problem? That would suggest the naptha link is not using eng?
– Frankie
Nov 10 at 15:53

Since tesseract.js is a JS "translation", perhaps it's based on an older version of Tesseract (not sure if you can query the library for that, to compare it to the CLI version you have). Eventually, if you can't get it working properly, you could consider using a Tesseract package that uses child_process to call the CLI version.
– robertklep
Nov 10 at 15:58

If the trained data isn't installed correctly, I would assume that that would severely hinder the performance. From this, I gather that the files should remain gzipped.
– robertklep
Nov 10 at 16:00

Thanks again, I'm not sure what some of that means but seeing as I downloaded it yesterday I doubt it is old? It is v strange that the naptha site is returning bad results for English
– Frankie
Nov 10 at 16:00

|
show 3 more comments

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240568%2fwhy-is-tesseract-less-accurate-when-run-via-nodejs-versus-when-run-directly-from%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

DPKkA3EGAj55IKTmGHTUCz EV3

搜尋此網誌

Myujth