Raised This Month: $51 Target: $400
 12% 

Web Data Extraction - How do you do?


Post New Thread Reply   
 
Thread Tools Display Modes
Author Message
stupok
Veteran Member
Join Date: Feb 2006
Old 05-10-2008 , 11:14   Web Data Extraction - How do you do?
Reply With Quote #1

Sup bros.

My goal is to write a program that will copy some text from a specific website maybe once every day (the same one every time) so I can write it to a txt file, for example.

I found this program on the internets, but I can't find a programming tutorial for this. I'm sure one of you can push me in the right direction faster than I can labor my way through google results.

Thanks in advance.
__________________
stupok is offline
micke1101
Veteran Member
Join Date: Jan 2008
Location: Banned-town
Old 05-10-2008 , 12:13   Re: Web Data Extraction - How do you do?
Reply With Quote #2

I would suggest you checked this
micke1101 is offline
stupok
Veteran Member
Join Date: Feb 2006
Old 05-10-2008 , 14:34   Re: Web Data Extraction - How do you do?
Reply With Quote #3

Thanks, micke1101.

I found this:
http://www.codeproject.com/KB/vb/Get...from_USPS.aspx

I think that should do it for me, but I'll have to install some VB developing software first. So far, the installer got stuck on step one, so I'll try again later. I was hoping for something for C++ or for Autohotkey, but maybe that's not possible.
__________________
stupok is offline
sawce
The null pointer exception error and virtual machine bug
Join Date: Oct 2004
Old 05-10-2008 , 15:40   Re: Web Data Extraction - How do you do?
Reply With Quote #4

You definitely can do it in C++, but it's not as forgiving as a dynamic language. Personally if I were to do something like this, it would either be in Python or Perl.
__________________
fyren sucks
sawce is offline
Xanimos
Veteran Member
Join Date: Apr 2005
Location: Florida
Old 05-12-2008 , 14:23   Re: Web Data Extraction - How do you do?
Reply With Quote #5

I've done this before in PHP, using curl + regex matching.
Xanimos is offline
Send a message via AIM to Xanimos Send a message via MSN to Xanimos
stupok
Veteran Member
Join Date: Feb 2006
Old 05-12-2008 , 15:12   Re: Web Data Extraction - How do you do?
Reply With Quote #6

I've rediscovered Greasemonkey, a Firefox add-on, and it can do exactly what I was originally hoping to do. The only problem is that I've never touched javascript, so my progress is very slow. However, I'm definitely going to go with Greasemonkey as a long-term solution for my goal. The VB program is just too clumsy as a solution.

I found another add-on called Platypus that works on top of Greasemonkey. Platypus is supposed to allow you to automagically create Greasemonkey scripts just by interacting with a website with your mouse instead of writing code yourself. It's amazingly cumbersome, in my experience, and the script did not "save". That is, the script only ran once and never again.

But! I successfully modded the VB program I linked above to do exactly what I wanted it to do, and I learned a lot about regex in the process. I never realized that regex was so important.


Here's the details of my goal if you're interested:
I often go to this website for information about the height of the waves on the coast of lake Michigan:
http://www.weather.com/outlook/recre...?zoneId=LMZ740

It pains me to decipher the condensed paragraph of all-caps text, so I want to use regex to format it nicely. I've made some good progress with it, here's a screenshot of the new text:



Feel free to help if you're bored!
__________________
stupok is offline
stupok
Veteran Member
Join Date: Feb 2006
Old 05-12-2008 , 21:18   Re: Web Data Extraction - How do you do?
Reply With Quote #7

Man, I'm starting to get frustrated with Platypus + Greasemonkey. I'm sure that Greasemonkey works just fine, but Platypus is horrid. None of the changes I make are saved.

I'm slowly figuring out how to piece together a script without using Platypus, but I'm using the code Platypus generates to guide me.

If there's a kind soul out there who has made Greasemonkey scripts in the past and doesn't mind helping me, I'd like to convert the regex and string manipulation commands below from Visual Basic to Javascript. If you could get a fully functional script going, all the better.

Also, feel free to point out any mistakes I've made.

Code:
    Public Function GetForcast(ByVal str As String) As String
        Dim myMatch As Match
        Dim myMatches As MatchCollection

        myMatch = Regex.Match(str, "WINTHROP")

        str = str.Substring(myMatch.Index())

        myMatch = Regex.Match(str, "\n")

        str = str.Substring(0, myMatch.Index())

        str = str.Replace("<P>", "")

        myMatch = Regex.Match(str, "-\d")

        str = str.Substring(myMatch.Index() + 1)

        str = str.Replace(". .", "..")
        str = str.Replace("....", "<br><br><b>")
        str = str.Replace("...", "</b><br>")
        str = str.Replace("..", "<br><br><b>")

        myMatch = Regex.Match(str, "M")

        str = str.Insert(myMatch.Index() - 4, ":")
        str = str.Insert(0, "<html><center><b><FONT FACE=""Verdana"" SIZE=""1"">")
        str = str.Insert(str.Length, "</FONT></center></html>")

        str = str.Replace(".", "<br>")

        myMatches = Regex.Matches(str, "[A-Z]\d")

        For Each myMatch In myMatches
            str = str.Insert(myMatch.Index() + 3, " ")
        Next

        Return str
    End Function

EDIT:
I finally fixed my Greasemonkey script and I have a pretty good result. The Visual Basic program, for some reason, let me control which text is bold and which is not with more fidelity, but I can't figure out how to get exactly the same result with the Greasemonkey script. Whatever, I did it. Hooray!
__________________

Last edited by stupok; 05-12-2008 at 21:51.
stupok is offline
Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -4. The time now is 05:58.


Powered by vBulletin®
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
Theme made by Freecode