Jump to content




Getting specific information from a website


  • You cannot reply to this topic
10 replies to this topic

#1 remiX

  • Members
  • 2,076 posts
  • LocationSouth Africa

Posted 05 February 2013 - 07:09 AM

Hey guys, I need to get information from a website which looks like this:

Spoiler

There's quite a lot there and I need to get information from sources like this:

<a href="http://pastebin.com/0qGFzhrU" target="_blank">Test</a>
                        <div class="linkURL" id="link_url_7016734">
                          pastebin.com/0qGFzhrU
                        </div>
                        <div class="linkURLFull" id="link_url_full_7016734" style="display:none">paste<wbr></wbr>bin.c<wbr></wbr>om/0q<wbr></wbr>GFzhr<wbr></wbr>U</div>
                      <p class="linkDescription">Test Description</p>
                        <div class="linkTimestamp">
                          Posted February 4, 2013 at 12:07 PM
                        </div>

I need:
1. Name - <a href="http://pastebin.com/0qGFzhrU" target="_blank">Test</a> - "Test"
2. Code - pastebin.com/0qGFzhrU (I need the last part)
3. Description - <p class="linkDescription">Test Description</p> - "Test Description"
4. Timestamp - Posted February 4, 2013 at 12:07 PM - "February 4, 2013 at 12:07 PM"

How can I achieve this? :X

#2 zekesonxx

  • Signature Abuser
  • 263 posts
  • LocationWhere you aren't

Posted 05 February 2013 - 07:25 AM

I think the best way would probably be Regular Expressions.

#3 Cranium

    Ninja Scripter

  • Moderators
  • 4,031 posts
  • LocationLincoln, Nebraska

Posted 05 February 2013 - 07:28 AM

You can use string.gmatch for this.
local nameTable = {}
for name in string.gmatch(<URLFULLTEXT>, '<A href="http://pastebin.com/.->(%w-)</A>' do
	table.insert(nameTable, name)
end
That should match every instance of the string with any pastebin code, and add the name to the table.

EDIT: Of course, <URLFULLTEXT> would be replaced with whatever http.get returns.

EDIT 2: Actually, you can take a look at how I matched strings with my SmartPaste program. It should be from lines 545 - 560 if you're interested.

Edited by Cranium, 05 February 2013 - 07:31 AM.


#4 remiX

  • Members
  • 2,076 posts
  • LocationSouth Africa

Posted 05 February 2013 - 07:40 AM

Yeah I've been messing around with string.gmatch but this is my first time using it so I'm kind of clueless!

I'm trying to get all four things into a table of a table:
t = {}

t[1] = {}

t[1].code = "First code it finds"
t[1].name = "First name it finds which has to match the code above"
t[1].desc = "First description it finds which matches the code/name"
t[1].timestamp = "The posted time"

So It will be easily printed and put together etc.

What I have made over the past 30 mins (yes, I know it's bad xD)

Spoiler

I know that you can use something like
for content in string.gmatch(urlText, "www.pastebin.com/(.-)") print(content) end
I'm able to do that but now how do I add it into a table in the right index, etc.

Going to take a look at your SmartPaste program now ...

#5 Cranium

    Ninja Scripter

  • Moderators
  • 4,031 posts
  • LocationLincoln, Nebraska

Posted 05 February 2013 - 07:46 AM

My mistake, the lines I gave you were wrong. I meant to say starting around 685.

#6 remiX

  • Members
  • 2,076 posts
  • LocationSouth Africa

Posted 05 February 2013 - 08:06 AM

View PostCranium, on 05 February 2013 - 07:46 AM, said:

My mistake, the lines I gave you were wrong. I meant to say starting around 685.

Yeah I used find to find it... But getting specific information from pastebin is easier because everything is encased in <> and you're inserting everything into one table. Would that work for me? And then combine then at the end. I think it would but I have no clue how xD

#7 Cranium

    Ninja Scripter

  • Moderators
  • 4,031 posts
  • LocationLincoln, Nebraska

Posted 05 February 2013 - 08:10 AM

Well, for each variable you are trying to match, you are going to need a new string.gmatch command, but you can put them all in the same table with a different index. Like response.name[1] would be the first instance it returns with the name variable. So it would be written to like this:

response = {}
for name in string.gmatch(string, "matchCommand") do
	table.insert(response.name, name)
end
for code in string.gmatch(string, "matchCommand2") do
    table.insert(response.code, code)
end
It is a super simplified example, because I just don't want to have to write out the whole command.

#8 remiX

  • Members
  • 2,076 posts
  • LocationSouth Africa

Posted 05 February 2013 - 08:15 AM

That won't work because there will be more than 1 code/name/description/title.

edit: misread, I'll check what I can do now...

edit2: Btw, forgot to ask: how do I get the full date?

using string.gmatch(text, "Posted (.-)") returns "February"

Posted February 4, 2013 at 12:07 PM


#9 Cranium

    Ninja Scripter

  • Moderators
  • 4,031 posts
  • LocationLincoln, Nebraska

Posted 05 February 2013 - 08:29 AM

You can do
string.gmatch(string, '<div class="linkTimestamp">(.-)</div>')
That should do anything within those tags.

#10 remiX

  • Members
  • 2,076 posts
  • LocationSouth Africa

Posted 05 February 2013 - 08:40 AM

Yeah but it has spaces:

<div class="linkTimestamp">
                          Posted February 4, 2013 at 12:07 PM
                        </div>

Anyway, looks like I got it!

i = 1
for code in sourceText:gmatch('><a href="http://pastebin.com/(.-)" target="_blank">') do
	t_Programs[i] = {}
	t_Programs[i].code = code
	i = i + 1
end

i = 1
for k = 1, #t_Programs do
	for title in sourceText:gmatch('<a href="http://pastebin.com/' .. t_Programs[k].code .. '" target="_blank">(.-)</a>') do
		t_Programs[i].name = title
	end
	i = i + 1
end

i = 1
for desc in sourceText:gmatch('<p class="linkDescription">(.-)</p>') do
	t_Programs[i].desc = desc
	i = i + 1
end

i = 1
for pDate in sourceText:gmatch([[<div class="linkTimestamp">
                          Posted (.-)
                        </div>]]) do
	t_Programs[i].postDate = pDate
	i = i + 1
end

for z = 1, #t_Programs do
	print(t_Programs[z].name .. " (" .. t_Programs[z].code .. ") - " .. t_Programs[z].postDate .. "\n")
	print(t_Programs[z].desc .. "\n")
end

Thanks :P

#11 Cranium

    Ninja Scripter

  • Moderators
  • 4,031 posts
  • LocationLincoln, Nebraska

Posted 05 February 2013 - 08:41 AM

Glad I could help!





3 user(s) are reading this topic

0 members, 3 guests, 0 anonymous users