Getting 502 Bad Gateway and then redirected to another website. Scrapy
scrapy mega noob here.
When I try to scrapy shell a website like for example:
scrapy shell https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane
I get the following messages:
...
[scrapy.core.engine] INFO: Spider opened
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (failed 1 times): 502 Bad Gateway
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (failed 2 times): 502 Bad Gateway
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (referer: None)
...
Then when I try to see what the response.body has:
In [1]: print(response.body)
b'<html><body><script>var $j='c';$3='c';$q='c';$G='f';$s='c';$F='c';$X='c';$t='c';$H='c';$e='=';$D='c';$g='c';$8='c';$A='c';$6='c';$O='=';$P='c';$U='5';$4='6';$y='c';$v='c';$u='c';$b='c';$V='b';$r='5';$2='6';$Q='f';$R='c';$5='c';$9='c';$c='c';$S='c';$l='c';$k='c';$m='_';$M='5';$N='c';$C='c';$d='c';$J='b';$E='5';$1='6';$i='f';document.cookie=(!4?$j:"")+(!""?$3:"")+(!4?$q:"")+(!4?$G:"")+(!()?$s:"")+(!NaN?$F:"")+(!NaN?$X:"")+(!?$t:"")+(!0?$H:"")+(!?$e:"")+(!4?$D:"")+(!""?$g:"")+(!""?$8:"")+(!?$A:"")+(!NaN?$6:"")+(!NaN?$O:"")+(!?$P:"")+(!0?$U:"")+(!()?$4:"")+(!4?$y:"")+(!""?$v:"")+(!0?$u:"")+(!0?$b:"")+(!""?$V:"")+(!0?$r:"")+(!0?$2:"")+(!""?$Q:"")+(!0?$R:"")+(!NaN?$5:"")+(!""?$9:"")+(!NaN?$c:"")+(!""?$S:"")+(!""?$l:"")+(!NaN?$k:"")+(!0?$m:"")+(!0?$M:"")+(!""?$N:"")+(!NaN?$C:"")+(!NaN?$d:"")+(!NaN?$J:"")+(!""?$E:"")+(!""?$1:"")+(!0?$i:"")+'; path=/';window.location.href=window.location.href;</script></body></html'
Which is not the website HTML, I can check in a Browser that the HTML of https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane
is totally different, therefore I know I'm being redirected to somewhere.
Question is how and why this is happening? and most importantly how to avoid it?
Additional info: I'm using a proxy service that will use random proxies each time I use Scrapy shell from a pool of over 20.000.
It's also worth noting that I've been scraping this webpage for quite a long time before this issue started.
shell redirect scrapy
add a comment |
scrapy mega noob here.
When I try to scrapy shell a website like for example:
scrapy shell https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane
I get the following messages:
...
[scrapy.core.engine] INFO: Spider opened
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (failed 1 times): 502 Bad Gateway
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (failed 2 times): 502 Bad Gateway
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (referer: None)
...
Then when I try to see what the response.body has:
In [1]: print(response.body)
b'<html><body><script>var $j='c';$3='c';$q='c';$G='f';$s='c';$F='c';$X='c';$t='c';$H='c';$e='=';$D='c';$g='c';$8='c';$A='c';$6='c';$O='=';$P='c';$U='5';$4='6';$y='c';$v='c';$u='c';$b='c';$V='b';$r='5';$2='6';$Q='f';$R='c';$5='c';$9='c';$c='c';$S='c';$l='c';$k='c';$m='_';$M='5';$N='c';$C='c';$d='c';$J='b';$E='5';$1='6';$i='f';document.cookie=(!4?$j:"")+(!""?$3:"")+(!4?$q:"")+(!4?$G:"")+(!()?$s:"")+(!NaN?$F:"")+(!NaN?$X:"")+(!?$t:"")+(!0?$H:"")+(!?$e:"")+(!4?$D:"")+(!""?$g:"")+(!""?$8:"")+(!?$A:"")+(!NaN?$6:"")+(!NaN?$O:"")+(!?$P:"")+(!0?$U:"")+(!()?$4:"")+(!4?$y:"")+(!""?$v:"")+(!0?$u:"")+(!0?$b:"")+(!""?$V:"")+(!0?$r:"")+(!0?$2:"")+(!""?$Q:"")+(!0?$R:"")+(!NaN?$5:"")+(!""?$9:"")+(!NaN?$c:"")+(!""?$S:"")+(!""?$l:"")+(!NaN?$k:"")+(!0?$m:"")+(!0?$M:"")+(!""?$N:"")+(!NaN?$C:"")+(!NaN?$d:"")+(!NaN?$J:"")+(!""?$E:"")+(!""?$1:"")+(!0?$i:"")+'; path=/';window.location.href=window.location.href;</script></body></html'
Which is not the website HTML, I can check in a Browser that the HTML of https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane
is totally different, therefore I know I'm being redirected to somewhere.
Question is how and why this is happening? and most importantly how to avoid it?
Additional info: I'm using a proxy service that will use random proxies each time I use Scrapy shell from a pool of over 20.000.
It's also worth noting that I've been scraping this webpage for quite a long time before this issue started.
shell redirect scrapy
add a comment |
scrapy mega noob here.
When I try to scrapy shell a website like for example:
scrapy shell https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane
I get the following messages:
...
[scrapy.core.engine] INFO: Spider opened
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (failed 1 times): 502 Bad Gateway
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (failed 2 times): 502 Bad Gateway
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (referer: None)
...
Then when I try to see what the response.body has:
In [1]: print(response.body)
b'<html><body><script>var $j='c';$3='c';$q='c';$G='f';$s='c';$F='c';$X='c';$t='c';$H='c';$e='=';$D='c';$g='c';$8='c';$A='c';$6='c';$O='=';$P='c';$U='5';$4='6';$y='c';$v='c';$u='c';$b='c';$V='b';$r='5';$2='6';$Q='f';$R='c';$5='c';$9='c';$c='c';$S='c';$l='c';$k='c';$m='_';$M='5';$N='c';$C='c';$d='c';$J='b';$E='5';$1='6';$i='f';document.cookie=(!4?$j:"")+(!""?$3:"")+(!4?$q:"")+(!4?$G:"")+(!()?$s:"")+(!NaN?$F:"")+(!NaN?$X:"")+(!?$t:"")+(!0?$H:"")+(!?$e:"")+(!4?$D:"")+(!""?$g:"")+(!""?$8:"")+(!?$A:"")+(!NaN?$6:"")+(!NaN?$O:"")+(!?$P:"")+(!0?$U:"")+(!()?$4:"")+(!4?$y:"")+(!""?$v:"")+(!0?$u:"")+(!0?$b:"")+(!""?$V:"")+(!0?$r:"")+(!0?$2:"")+(!""?$Q:"")+(!0?$R:"")+(!NaN?$5:"")+(!""?$9:"")+(!NaN?$c:"")+(!""?$S:"")+(!""?$l:"")+(!NaN?$k:"")+(!0?$m:"")+(!0?$M:"")+(!""?$N:"")+(!NaN?$C:"")+(!NaN?$d:"")+(!NaN?$J:"")+(!""?$E:"")+(!""?$1:"")+(!0?$i:"")+'; path=/';window.location.href=window.location.href;</script></body></html'
Which is not the website HTML, I can check in a Browser that the HTML of https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane
is totally different, therefore I know I'm being redirected to somewhere.
Question is how and why this is happening? and most importantly how to avoid it?
Additional info: I'm using a proxy service that will use random proxies each time I use Scrapy shell from a pool of over 20.000.
It's also worth noting that I've been scraping this webpage for quite a long time before this issue started.
shell redirect scrapy
scrapy mega noob here.
When I try to scrapy shell a website like for example:
scrapy shell https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane
I get the following messages:
...
[scrapy.core.engine] INFO: Spider opened
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (failed 1 times): 502 Bad Gateway
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (failed 2 times): 502 Bad Gateway
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane> (referer: None)
...
Then when I try to see what the response.body has:
In [1]: print(response.body)
b'<html><body><script>var $j='c';$3='c';$q='c';$G='f';$s='c';$F='c';$X='c';$t='c';$H='c';$e='=';$D='c';$g='c';$8='c';$A='c';$6='c';$O='=';$P='c';$U='5';$4='6';$y='c';$v='c';$u='c';$b='c';$V='b';$r='5';$2='6';$Q='f';$R='c';$5='c';$9='c';$c='c';$S='c';$l='c';$k='c';$m='_';$M='5';$N='c';$C='c';$d='c';$J='b';$E='5';$1='6';$i='f';document.cookie=(!4?$j:"")+(!""?$3:"")+(!4?$q:"")+(!4?$G:"")+(!()?$s:"")+(!NaN?$F:"")+(!NaN?$X:"")+(!?$t:"")+(!0?$H:"")+(!?$e:"")+(!4?$D:"")+(!""?$g:"")+(!""?$8:"")+(!?$A:"")+(!NaN?$6:"")+(!NaN?$O:"")+(!?$P:"")+(!0?$U:"")+(!()?$4:"")+(!4?$y:"")+(!""?$v:"")+(!0?$u:"")+(!0?$b:"")+(!""?$V:"")+(!0?$r:"")+(!0?$2:"")+(!""?$Q:"")+(!0?$R:"")+(!NaN?$5:"")+(!""?$9:"")+(!NaN?$c:"")+(!""?$S:"")+(!""?$l:"")+(!NaN?$k:"")+(!0?$m:"")+(!0?$M:"")+(!""?$N:"")+(!NaN?$C:"")+(!NaN?$d:"")+(!NaN?$J:"")+(!""?$E:"")+(!""?$1:"")+(!0?$i:"")+'; path=/';window.location.href=window.location.href;</script></body></html'
Which is not the website HTML, I can check in a Browser that the HTML of https://shop.coles.com.au/a/a-vic-metro-oakleigh/product/gasmate-cartridge-butane
is totally different, therefore I know I'm being redirected to somewhere.
Question is how and why this is happening? and most importantly how to avoid it?
Additional info: I'm using a proxy service that will use random proxies each time I use Scrapy shell from a pool of over 20.000.
It's also worth noting that I've been scraping this webpage for quite a long time before this issue started.
shell redirect scrapy
shell redirect scrapy
asked Nov 15 '18 at 23:33
JackknifeJackknife
356
356
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
If you look at the Javascript code, it sets a cookie and redirects on itself.
It seems that the website expects you to have a specific cookie to access the "normal" pages, but since scrapy can't execute javascript, it stops there.
You may want to parse the Javascript code somehow, and set your cookie manually and re-query the same URL.
That makes sense. Thank you
– Jackknife
Nov 18 '18 at 22:27
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53329336%2fgetting-502-bad-gateway-and-then-redirected-to-another-website-scrapy%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you look at the Javascript code, it sets a cookie and redirects on itself.
It seems that the website expects you to have a specific cookie to access the "normal" pages, but since scrapy can't execute javascript, it stops there.
You may want to parse the Javascript code somehow, and set your cookie manually and re-query the same URL.
That makes sense. Thank you
– Jackknife
Nov 18 '18 at 22:27
add a comment |
If you look at the Javascript code, it sets a cookie and redirects on itself.
It seems that the website expects you to have a specific cookie to access the "normal" pages, but since scrapy can't execute javascript, it stops there.
You may want to parse the Javascript code somehow, and set your cookie manually and re-query the same URL.
That makes sense. Thank you
– Jackknife
Nov 18 '18 at 22:27
add a comment |
If you look at the Javascript code, it sets a cookie and redirects on itself.
It seems that the website expects you to have a specific cookie to access the "normal" pages, but since scrapy can't execute javascript, it stops there.
You may want to parse the Javascript code somehow, and set your cookie manually and re-query the same URL.
If you look at the Javascript code, it sets a cookie and redirects on itself.
It seems that the website expects you to have a specific cookie to access the "normal" pages, but since scrapy can't execute javascript, it stops there.
You may want to parse the Javascript code somehow, and set your cookie manually and re-query the same URL.
answered Nov 16 '18 at 18:25
GuillaumeGuillaume
1,1581724
1,1581724
That makes sense. Thank you
– Jackknife
Nov 18 '18 at 22:27
add a comment |
That makes sense. Thank you
– Jackknife
Nov 18 '18 at 22:27
That makes sense. Thank you
– Jackknife
Nov 18 '18 at 22:27
That makes sense. Thank you
– Jackknife
Nov 18 '18 at 22:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53329336%2fgetting-502-bad-gateway-and-then-redirected-to-another-website-scrapy%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown