udim ([info]udim) wrote,
@ 2006-11-06 12:28:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Tiny Mix Tapes to ATOM
The ATOM feed: http://stuff.pulkes.org/tmt2atom.php
php source

So I wasted a Saturday creating a website-to-rss php script for sites that don't have rss. Anyway I went back and forth between trying to use the XML parser, writing my own HTML parser, and trying to find an already written HTML parser:

  • Trying to parse HTML as XML doesn't work. Even if you strip most of the tags and add a dummy enclosing tag. XML is just too anal (at least PHP's) and most HTML is buggy (unescaped &'s for instance).

  • html tidy wasn't compiled in with dreamhost's php, and when I tried rolling my own I found they didn't have libtidy installed and I decided to give up on it.

  • Writing my own parser, I couldn't shake the nagging feeling that I was reinventing the wheel. Also, it didn't take long (only a couple of hours, but I have MANY hours to spare) to reach the first hurdle: PHP is SLOW! And then I remembered that PHP's XML parser uses libexpat, which is written in C, and it all went downhill from here.

I saw the light when I gave up on making a general script that could fit any website and resorted to old-fashioned regular expressions. Much faster, and it still ended up being a general script.

Now I just have to wait for TMT to update to see if it really works.

Update: 1. It works. 2. JWZ already did this a long time ago.

keywords: atom, rss, feed, tinymixtapes, tiny mix tapes, tmt2atom



(3 comments) - (Post a new comment)

XML is just fucked up, XML-parsing libraries more so.
[info]shayel
2006-11-06 11:40 am UTC (link)
XML has become everyone's favorite nail. That's the problem with programming trends, like OOP or Hungarian Notation (which is a great example of a good thing gone terribly wrong) – they get adopted quickly, but most of the adopters doesn't understand why they should be used, and thus also not get the how.

Whenever I need to get something useful out of XML, I just use regexps. I figure it's much easier to just write a simple regexp every time, then learn and debug an XML parser.

(Reply to this) (Thread)

Re: XML is just fucked up, XML-parsing libraries more so.
[info]udim
2006-11-06 08:42 pm UTC (link)
I really don't know: when is XML a good choice? Cross-platform interoperability? Backwards compatibility? Other?

For example: I had a choice of RSS and ATOM formats and I chose ATOM, quite haphazardly, because it was newer and better standardized and it's moving forward (or so they said).
I can now safely back up that decision, having now experienced some of the horrors of parsing a particular plaintext-based format. Specifically. it's the need to distinguish between directives and the actual text, when they are written in the same alphabet! The DNS system is a well executed example: domain names don't have slashes in them. HTML is not: escaping certain characters, such as when a literal '<' becomes '&lt;', is not a task for mere mortals (and making that sentence display correctly took me several attempts).
Why don't we just use a format that uses nulls or some other non-typeable character to differentiate between text and directives? I don't know. At least for now, we have XML (XHTML) and its grammatical zeal to protect us.

(Reply to this) (Parent)

(Reply from suspended user)

(3 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…