Re: not work for a page in weather.com

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: not work for a page in weather.com

Xue-Feng Yang-2
I made more experiments on the issue. I added the following 

webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

JavaScriptJobManager manager = htmlPage.getEnclosingWindow().getJobManager();
int count = 0;
while(manager.getJobCount() > 0){
System.out.println(count + "@" + manager.getJobCount());
webClient.waitForBackgroundJavaScript(10000);
        count ++;
        }

Then I went to sleep. It's been running for a few hours. The job count has been changed from 20 to 3 and stayed at 3.

Any thought?

Thanks

On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <[hidden email]> wrote:

Hi, I used htmlunit for getting some other web pages. It works great.

However, when I tried https://weather.com/weather/monthly/l/27560:4:US , I got something not correct.

Here are the summary of my system:

OS: win 10
Java: jdk1.8.0_131
htmlunit: htmlunit-2.27-bin

Attached are three pictures.

eclipse-debug gives the result htmlunit got. The main code is as follows:

        webClient = new WebClient(BrowserVersion.FIREFOX_45);
        webClient.getOptions().setTimeout(600 * 1000);
        webClient.waitForBackgroundJavaScript(600 * 1000);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);

        htmlPage = webClient.getPage(_url);
        page = htmlPage.asXml();

view-source is the source page from Firefox.

inspector is the debug tree from Firefox is debugger.

It shows only Firefox debugger has the right html tree.

My question is how to get the html tree by use of htmlunit?

Thanks,

Xuefeng


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
Reply | Threaded
Open this post in threaded view
|

Re: not work for a page in weather.com

albu77

You are really testing my memory man....

The idea,(my idea) is there are some timers set in the page (auto refresh, update or so...) and as it is explained here: http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open

You cannot reliably tell if there are any unnamed intervals running, but you can shut down any that are open.

In previous answer You can see a call to a methode call attendPourJavascriptSaufTimers, for example in :

// add a fake submit button to be able to submit the form( I translated from french)
                    loginForm.appendChild(fauxBouton );
                    pageEnCours = fauxBouton.click();   
                    //webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo());  Original call but I got trouble so:
                    webClient.attendPourJavascriptSaufTimers(pageEnCours, AttentePourJavascript.CINQ_SECONDES.getTempo());
                    print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), pageEnCours.asXml(), original); //Waiting for 5 seconds but could return before if nothing is running

What this method is doing:

public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){

        String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an enumeration where the scripts are described
        Object result = page.executeJavaScript(texteDuScript).getJavaScriptResult();
        int retour = this.waitForBackgroundJavaScript(tempo);
        return retour;
    }

the script executed (ANNULE_LES_TIMERS is the following:
limit= 10;
 var np, n= setInterval(function(){},100000);
 np= Math.max(0, n-limit);
 while(n> np){
 clearInterval(n--);
 }

If I wrote all this stuff it was because I was running into problems like you are , not getting all the page content I should, so my advise is to follow a little bit my track...even If I don't remember all the details
I think also you can see if there are interval set with the website you are scrapping and DevTools console of your browser
I remember having done these back and forth sessions between DevTools and htmlunit, you really have to understand completely what's running on the site if you want to mimic it.

Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit :
I made more experiments on the issue. I added the following 

webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

JavaScriptJobManager manager = htmlPage.getEnclosingWindow().getJobManager();
int count = 0;
while(manager.getJobCount() > 0){
System.out.println(count + "@" + manager.getJobCount());
webClient.waitForBackgroundJavaScript(10000);
        count ++;
        }

Then I went to sleep. It's been running for a few hours. The job count has been changed from 20 to 3 and stayed at 3.

Any thought?

Thanks

On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <[hidden email]> wrote:

Hi, I used htmlunit for getting some other web pages. It works great.

However, when I tried https://weather.com/weather/monthly/l/27560:4:US , I got something not correct.

Here are the summary of my system:

OS: win 10
Java: jdk1.8.0_131
htmlunit: htmlunit-2.27-bin

Attached are three pictures.

eclipse-debug gives the result htmlunit got. The main code is as follows:

        webClient = new WebClient(BrowserVersion.FIREFOX_45);
        webClient.getOptions().setTimeout(600 * 1000);
        webClient.waitForBackgroundJavaScript(600 * 1000);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);

        htmlPage = webClient.getPage(_url);
        page = htmlPage.asXml();

view-source is the source page from Firefox.

inspector is the debug tree from Firefox is debugger.

It shows only Firefox debugger has the right html tree.

My question is how to get the html tree by use of htmlunit?

Thanks,

Xuefeng



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot


_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
Reply | Threaded
Open this post in threaded view
|

Re: not work for a page in weather.com

Xue-Feng Yang-2
Thanks.  It's a little complicated solution since I need to load a few hundreds of remote pages. I'll try this later if my current method don't work. 

On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <[hidden email]> wrote:

You are really testing my memory man....

The idea,(my idea) is there are some timers set in the page (auto refresh, update or so...) and as it is explained here: http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open

You cannot reliably tell if there are any unnamed intervals running, but you can shut down any that are open.

In previous answer You can see a call to a methode call attendPourJavascriptSaufTimers, for example in :

// add a fake submit button to be able to submit the form( I translated from french)
                    loginForm.appendChild(fauxBouton );
                    pageEnCours = fauxBouton.click();   
                    //webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo());  Original call but I got trouble so:
                    webClient.attendPourJavascriptSaufTimers(pageEnCours, AttentePourJavascript.CINQ_SECONDES.getTempo());
                    print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), pageEnCours.asXml(), original); //Waiting for 5 seconds but could return before if nothing is running

What this method is doing:

public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){

        String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an enumeration where the scripts are described
        Object result = page.executeJavaScript(texteDuScript).getJavaScriptResult();
        int retour = this.waitForBackgroundJavaScript(tempo);
        return retour;
    }

the script executed (ANNULE_LES_TIMERS is the following:
limit= 10;
 var np, n= setInterval(function(){},100000);
 np= Math.max(0, n-limit);
 while(n> np){
 clearInterval(n--);
 }

If I wrote all this stuff it was because I was running into problems like you are , not getting all the page content I should, so my advise is to follow a little bit my track...even If I don't remember all the details
I think also you can see if there are interval set with the website you are scrapping and DevTools console of your browser
I remember having done these back and forth sessions between DevTools and htmlunit, you really have to understand completely what's running on the site if you want to mimic it.


Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit :
I made more experiments on the issue. I added the following 

webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

JavaScriptJobManager manager = htmlPage.getEnclosingWindow().getJobManager();
int count = 0;
while(manager.getJobCount() > 0){
System.out.println(count + "@" + manager.getJobCount());
webClient.waitForBackgroundJavaScript(10000);
        count ++;
        }

Then I went to sleep. It's been running for a few hours. The job count has been changed from 20 to 3 and stayed at 3.

Any thought?

Thanks

On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <[hidden email]> wrote:

Hi, I used htmlunit for getting some other web pages. It works great.

However, when I tried https://weather.com/weather/monthly/l/27560:4:US , I got something not correct.

Here are the summary of my system:

OS: win 10
Java: jdk1.8.0_131
htmlunit: htmlunit-2.27-bin

Attached are three pictures.

eclipse-debug gives the result htmlunit got. The main code is as follows:

        webClient = new WebClient(BrowserVersion.FIREFOX_45);
        webClient.getOptions().setTimeout(600 * 1000);
        webClient.waitForBackgroundJavaScript(600 * 1000);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);

        htmlPage = webClient.getPage(_url);
        page = htmlPage.asXml();

view-source is the source page from Firefox.

inspector is the debug tree from Firefox is debugger.

It shows only Firefox debugger has the right html tree.

My question is how to get the html tree by use of htmlunit?

Thanks,

Xuefeng



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot


_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
Reply | Threaded
Open this post in threaded view
|

Re: not work for a page in weather.com

albu77

I don't understand what you mean by "load a few hundred of remote pages", htmlunit is used to interact with pages, it's a silent browser. You interact with hundred of pages ?



Le 13/07/2017 à 19:44, Xue-Feng Yang a écrit :
Thanks.  It's a little complicated solution since I need to load a few hundreds of remote pages. I'll try this later if my current method don't work. 

On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <[hidden email]> wrote:

You are really testing my memory man....

The idea,(my idea) is there are some timers set in the page (auto refresh, update or so...) and as it is explained here: http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open

You cannot reliably tell if there are any unnamed intervals running, but you can shut down any that are open.

In previous answer You can see a call to a methode call attendPourJavascriptSaufTimers, for example in :

// add a fake submit button to be able to submit the form( I translated from french)
                    loginForm.appendChild(fauxBouton );
                    pageEnCours = fauxBouton.click();   
                    //webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo());  Original call but I got trouble so:
                    webClient.attendPourJavascriptSaufTimers(pageEnCours, AttentePourJavascript.CINQ_SECONDES.getTempo());
                    print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), pageEnCours.asXml(), original); //Waiting for 5 seconds but could return before if nothing is running

What this method is doing:

public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){

        String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an enumeration where the scripts are described
        Object result = page.executeJavaScript(texteDuScript).getJavaScriptResult();
        int retour = this.waitForBackgroundJavaScript(tempo);
        return retour;
    }

the script executed (ANNULE_LES_TIMERS is the following:
limit= 10;
 var np, n= setInterval(function(){},100000);
 np= Math.max(0, n-limit);
 while(n> np){
 clearInterval(n--);
 }

If I wrote all this stuff it was because I was running into problems like you are , not getting all the page content I should, so my advise is to follow a little bit my track...even If I don't remember all the details
I think also you can see if there are interval set with the website you are scrapping and DevTools console of your browser
I remember having done these back and forth sessions between DevTools and htmlunit, you really have to understand completely what's running on the site if you want to mimic it.


Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit :
I made more experiments on the issue. I added the following 

webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

JavaScriptJobManager manager = htmlPage.getEnclosingWindow().getJobManager();
int count = 0;
while(manager.getJobCount() > 0){
System.out.println(count + "@" + manager.getJobCount());
webClient.waitForBackgroundJavaScript(10000);
        count ++;
        }

Then I went to sleep. It's been running for a few hours. The job count has been changed from 20 to 3 and stayed at 3.

Any thought?

Thanks

On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <[hidden email]> wrote:

Hi, I used htmlunit for getting some other web pages. It works great.

However, when I tried https://weather.com/weather/monthly/l/27560:4:US , I got something not correct.

Here are the summary of my system:

OS: win 10
Java: jdk1.8.0_131
htmlunit: htmlunit-2.27-bin

Attached are three pictures.

eclipse-debug gives the result htmlunit got. The main code is as follows:

        webClient = new WebClient(BrowserVersion.FIREFOX_45);
        webClient.getOptions().setTimeout(600 * 1000);
        webClient.waitForBackgroundJavaScript(600 * 1000);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);

        htmlPage = webClient.getPage(_url);
        page = htmlPage.asXml();

view-source is the source page from Firefox.

inspector is the debug tree from Firefox is debugger.

It shows only Firefox debugger has the right html tree.

My question is how to get the html tree by use of htmlunit?

Thanks,

Xuefeng



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Htmlunit-user mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/htmlunit-user
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
Reply | Threaded
Open this post in threaded view
|

Re: not work for a page in weather.com

Xue-Feng Yang-2
Yes, I use it to download data with the different parameters.

Thanks.

On Thu, Jul 13, 2017 at 1:48 PM, Albu Gmail <[hidden email]> wrote:

I don't understand what you mean by "load a few hundred of remote pages", htmlunit is used to interact with pages, it's a silent browser. You interact with hundred of pages ?



Le 13/07/2017 à 19:44, Xue-Feng Yang a écrit :
Thanks.  It's a little complicated solution since I need to load a few hundreds of remote pages. I'll try this later if my current method don't work. 

On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <[hidden email]> wrote:

You are really testing my memory man....

The idea,(my idea) is there are some timers set in the page (auto refresh, update or so...) and as it is explained here: http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open

You cannot reliably tell if there are any unnamed intervals running, but you can shut down any that are open.

In previous answer You can see a call to a methode call attendPourJavascriptSaufTimers, for example in :

// add a fake submit button to be able to submit the form( I translated from french)
                    loginForm.appendChild(fauxBouton );
                    pageEnCours = fauxBouton.click();   
                    //webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo());  Original call but I got trouble so:
                    webClient.attendPourJavascriptSaufTimers(pageEnCours, AttentePourJavascript.CINQ_SECONDES.getTempo());
                    print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), pageEnCours.asXml(), original); //Waiting for 5 seconds but could return before if nothing is running

What this method is doing:

public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){

        String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an enumeration where the scripts are described
        Object result = page.executeJavaScript(texteDuScript).getJavaScriptResult();
        int retour = this.waitForBackgroundJavaScript(tempo);
        return retour;
    }

the script executed (ANNULE_LES_TIMERS is the following:
limit= 10;
 var np, n= setInterval(function(){},100000);
 np= Math.max(0, n-limit);
 while(n> np){
 clearInterval(n--);
 }

If I wrote all this stuff it was because I was running into problems like you are , not getting all the page content I should, so my advise is to follow a little bit my track...even If I don't remember all the details
I think also you can see if there are interval set with the website you are scrapping and DevTools console of your browser
I remember having done these back and forth sessions between DevTools and htmlunit, you really have to understand completely what's running on the site if you want to mimic it.


Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit :
I made more experiments on the issue. I added the following 

webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

JavaScriptJobManager manager = htmlPage.getEnclosingWindow().getJobManager();
int count = 0;
while(manager.getJobCount() > 0){
System.out.println(count + "@" + manager.getJobCount());
webClient.waitForBackgroundJavaScript(10000);
        count ++;
        }

Then I went to sleep. It's been running for a few hours. The job count has been changed from 20 to 3 and stayed at 3.

Any thought?

Thanks

On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <[hidden email]> wrote:

Hi, I used htmlunit for getting some other web pages. It works great.

However, when I tried https://weather.com/weather/monthly/l/27560:4:US , I got something not correct.

Here are the summary of my system:

OS: win 10
Java: jdk1.8.0_131
htmlunit: htmlunit-2.27-bin

Attached are three pictures.

eclipse-debug gives the result htmlunit got. The main code is as follows:

        webClient = new WebClient(BrowserVersion.FIREFOX_45);
        webClient.getOptions().setTimeout(600 * 1000);
        webClient.waitForBackgroundJavaScript(600 * 1000);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);

        htmlPage = webClient.getPage(_url);
        page = htmlPage.asXml();

view-source is the source page from Firefox.

inspector is the debug tree from Firefox is debugger.

It shows only Firefox debugger has the right html tree.

My question is how to get the html tree by use of htmlunit?

Thanks,

Xuefeng



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Htmlunit-user mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/htmlunit-user
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
Reply | Threaded
Open this post in threaded view
|

Re: not work for a page in weather.com

Xue-Feng Yang-2
In reply to this post by albu77
I have more progress now. After testing, BrowserVersion Chrome and FireFox52(or the best supported) can reduce the job count to 2 in 5 minutes, and the data I needed is in webClient now. The others (FireFox45, Edge, IE) can't reduce the job count to 2 in 5 minutes and I can't see the data in webClient when the job count only reduces to 3 or more.

It's worth to point out that all real browsers I tested can show the data in less than one minute.

Any more suggestion?

On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <[hidden email]> wrote:

You are really testing my memory man....

The idea,(my idea) is there are some timers set in the page (auto refresh, update or so...) and as it is explained here: http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open

You cannot reliably tell if there are any unnamed intervals running, but you can shut down any that are open.

In previous answer You can see a call to a methode call attendPourJavascriptSaufTimers, for example in :

// add a fake submit button to be able to submit the form( I translated from french)
                    loginForm.appendChild(fauxBouton );
                    pageEnCours = fauxBouton.click();   
                    //webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo());  Original call but I got trouble so:
                    webClient.attendPourJavascriptSaufTimers(pageEnCours, AttentePourJavascript.CINQ_SECONDES.getTempo());
                    print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), pageEnCours.asXml(), original); //Waiting for 5 seconds but could return before if nothing is running

What this method is doing:

public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){

        String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an enumeration where the scripts are described
        Object result = page.executeJavaScript(texteDuScript).getJavaScriptResult();
        int retour = this.waitForBackgroundJavaScript(tempo);
        return retour;
    }

the script executed (ANNULE_LES_TIMERS is the following:
limit= 10;
 var np, n= setInterval(function(){},100000);
 np= Math.max(0, n-limit);
 while(n> np){
 clearInterval(n--);
 }

If I wrote all this stuff it was because I was running into problems like you are , not getting all the page content I should, so my advise is to follow a little bit my track...even If I don't remember all the details
I think also you can see if there are interval set with the website you are scrapping and DevTools console of your browser
I remember having done these back and forth sessions between DevTools and htmlunit, you really have to understand completely what's running on the site if you want to mimic it.


Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit :
I made more experiments on the issue. I added the following 

webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

JavaScriptJobManager manager = htmlPage.getEnclosingWindow().getJobManager();
int count = 0;
while(manager.getJobCount() > 0){
System.out.println(count + "@" + manager.getJobCount());
webClient.waitForBackgroundJavaScript(10000);
        count ++;
        }

Then I went to sleep. It's been running for a few hours. The job count has been changed from 20 to 3 and stayed at 3.

Any thought?

Thanks

On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <[hidden email]> wrote:

Hi, I used htmlunit for getting some other web pages. It works great.

However, when I tried https://weather.com/weather/monthly/l/27560:4:US , I got something not correct.

Here are the summary of my system:

OS: win 10
Java: jdk1.8.0_131
htmlunit: htmlunit-2.27-bin

Attached are three pictures.

eclipse-debug gives the result htmlunit got. The main code is as follows:

        webClient = new WebClient(BrowserVersion.FIREFOX_45);
        webClient.getOptions().setTimeout(600 * 1000);
        webClient.waitForBackgroundJavaScript(600 * 1000);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);

        htmlPage = webClient.getPage(_url);
        page = htmlPage.asXml();

view-source is the source page from Firefox.

inspector is the debug tree from Firefox is debugger.

It shows only Firefox debugger has the right html tree.

My question is how to get the html tree by use of htmlunit?

Thanks,

Xuefeng



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot


_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user





------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
Reply | Threaded
Open this post in threaded view
|

Re: not work for a page in weather.com

albu77

Which data are you extracting exactly can yu tell me it could help me to see if it could be a workaround



Le 14/07/2017 à 04:36, Xue-Feng Yang a écrit :
I have more progress now. After testing, BrowserVersion Chrome and FireFox52(or the best supported) can reduce the job count to 2 in 5 minutes, and the data I needed is in webClient now. The others (FireFox45, Edge, IE) can't reduce the job count to 2 in 5 minutes and I can't see the data in webClient when the job count only reduces to 3 or more.

It's worth to point out that all real browsers I tested can show the data in less than one minute.

Any more suggestion?

On Thu, Jul 13, 2017 at 12:48 PM, Albu Gmail <[hidden email]> wrote:

You are really testing my memory man....

The idea,(my idea) is there are some timers set in the page (auto refresh, update or so...) and as it is explained here: http://www.webdeveloper.com/forum/showthread.php?233448-Is-there-a-way-to-find-if-any-intervals-are-still-open

You cannot reliably tell if there are any unnamed intervals running, but you can shut down any that are open.

In previous answer You can see a call to a methode call attendPourJavascriptSaufTimers, for example in :

// add a fake submit button to be able to submit the form( I translated from french)
                    loginForm.appendChild(fauxBouton );
                    pageEnCours = fauxBouton.click();   
                    //webClient.waitForBackgroundJavaScript(AttentePourJavascript.CINQ_SECONDES.getTempo());  Original call but I got trouble so:
                    webClient.attendPourJavascriptSaufTimers(pageEnCours, AttentePourJavascript.CINQ_SECONDES.getTempo());
                    print.save(NomsFichiersPagesSauvegardees.APRES_LOGGING.getUrl(), pageEnCours.asXml(), original); //Waiting for 5 seconds but could return before if nothing is running

What this method is doing:

public int attendPourJavascriptSaufTimers(HtmlPage page,long tempo){

        String texteDuScript = ScriptAExecuter.ANNULE_LES_TIMERS.getScript(); //Use an enumeration where the scripts are described
        Object result = page.executeJavaScript(texteDuScript).getJavaScriptResult();
        int retour = this.waitForBackgroundJavaScript(tempo);
        return retour;
    }

the script executed (ANNULE_LES_TIMERS is the following:
limit= 10;
 var np, n= setInterval(function(){},100000);
 np= Math.max(0, n-limit);
 while(n> np){
 clearInterval(n--);
 }

If I wrote all this stuff it was because I was running into problems like you are , not getting all the page content I should, so my advise is to follow a little bit my track...even If I don't remember all the details
I think also you can see if there are interval set with the website you are scrapping and DevTools console of your browser
I remember having done these back and forth sessions between DevTools and htmlunit, you really have to understand completely what's running on the site if you want to mimic it.


Le 13/07/2017 à 17:36, Xue-Feng Yang a écrit :
I made more experiments on the issue. I added the following 

webClient.getOptions().setUseInsecureSSL(true);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

JavaScriptJobManager manager = htmlPage.getEnclosingWindow().getJobManager();
int count = 0;
while(manager.getJobCount() > 0){
System.out.println(count + "@" + manager.getJobCount());
webClient.waitForBackgroundJavaScript(10000);
        count ++;
        }

Then I went to sleep. It's been running for a few hours. The job count has been changed from 20 to 3 and stayed at 3.

Any thought?

Thanks

On Wed, Jul 12, 2017 at 10:56 PM, Xue-Feng Yang <[hidden email]> wrote:

Hi, I used htmlunit for getting some other web pages. It works great.

However, when I tried https://weather.com/weather/monthly/l/27560:4:US , I got something not correct.

Here are the summary of my system:

OS: win 10
Java: jdk1.8.0_131
htmlunit: htmlunit-2.27-bin

Attached are three pictures.

eclipse-debug gives the result htmlunit got. The main code is as follows:

        webClient = new WebClient(BrowserVersion.FIREFOX_45);
        webClient.getOptions().setTimeout(600 * 1000);
        webClient.waitForBackgroundJavaScript(600 * 1000);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);

        htmlPage = webClient.getPage(_url);
        page = htmlPage.asXml();

view-source is the source page from Firefox.

inspector is the debug tree from Firefox is debugger.

It shows only Firefox debugger has the right html tree.

My question is how to get the html tree by use of htmlunit?

Thanks,

Xuefeng



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Htmlunit-user mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/htmlunit-user
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Htmlunit-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/htmlunit-user