bugÉtoilé - Bugs: bug #8584, bad html parsing

 
 
Show feedback again

You are not allowed to post comments on this tracker with your current authentification level.

bug #8584: bad html parsing

Submitted by:  Nicolas Roard <rio>
Submitted on:  Tue 27 Feb 2007 02:24:08 AM UTC  
 
Category: Grr / RSSKitSeverity: 3 - Normal
Priority: 1 - LaterStatus: None
Privacy: PublicAssigned to: Guenther Noack <guenther>
Open/Closed: OpenOperating System: None

Wed 28 Feb 2007 02:43:09 AM UTC, comment #3:

I have a minimal part of hpricot under
/etoile/branches/yjchen/hpricot_scan.

It uses Ragel to generate code.
But once code is generated, it is pure C without dependency.
So there is no problem of using it.
It compiles slow, but runs fast.

Yen-Ju Chen <yjchen>
Project Member
Tue 27 Feb 2007 11:46:59 PM UTC, comment #2:

I had the problem on the étoilé blog feed :-)

The strange thing is, now you tell me, I remember the html parser code in grr. I even remember it kinda working (apart from images). Yet the articles from this feed showed the html entities.

For hpricot, it's just an idea -- without having looked into it myself I won't say much more.

For the text loader idea in gnustep, the architecture is not appropriate for, say, a webbrowser. But it's more than enought for a RSS viewer imho.

Nicolas Roard <rio>
Project Administrator
Tue 27 Feb 2007 10:13:16 PM UTC, comment #1:

Grr has a HTML parser. You can find it at Components/ArticleView/NSString+TolerantHTML.[mh]. When the HTML parser encounters any errors that cause it to throw an exception, the article view falls back to plain text display. Could you please give me a copy of the feed that made problems with HTML parsing?

Concerning hpricot:

I'll have a look if hpricot makes sense for the article view component. The current HTML parsing code can be roughly divided in two parts: the parser itself (which reads in the tags and the text) and the interpreter (which translates the tags into an attributed string). I hope to be able to replace the parser with hpricot here.

Concerning the GNUstep text loader idea:

The last time I've looked at that, the text loader architecture looked inappropriate to me to seamlessly fit in the Grr HTML parsing code. But maybe I'm wrong there.

Another thing that comes to my mind concerning the text loader is that GNUstep has a pretty strict no-external-dependencies policy,
which means that hpricot is a no-no unless it is small enough to be put into GNUstep as well.

Guenther Noack <guenther>
Project MemberIn charge of this item.
Tue 27 Feb 2007 02:24:08 AM UTC, original submission:

grr has no html parser apparently, so feeds aren't rendered properly. Ideally it should support at least basic html tags (i,b,br,ul,li,ol,img,url). A possible solution is to use hpricot; and if we have an html parser that "output" an nsattributedstring, we could move that to a gnustep text loader :) (eg, fix that "bug" upstream).

Nicolas Roard <rio>
Project Administrator

 

No files currently attached

 

Depends on the following items: None found

Items that depend on this one: None found

 

Carbon-Copy List
  • -unavailable- added by yjchen (Posted a comment)
  • -unavailable- added by rio (Submitted the item)
  •  

    Do you think this task is very important?
    If so, you can click here to add your encouragement to it.
    This task has 0 encouragements so far.

    Only logged-in users can vote.

     

    Please enter the title of George Orwell's famous dystopian book (it's a date):

     

     

    Follow 5 latest changes.

    Date Changed By Updated Field Previous Value => Replaced By
    Sat 01 Nov 2008 07:28:04 PM UTCguentherStatusWont Fix=>None
    Sat 01 Nov 2008 07:27:50 PM UTCguentherStatusIn Progress=>Wont Fix
    Thu 01 Mar 2007 09:44:25 AM UTCguentherStatusNeed Info=>In Progress
    Wed 28 Feb 2007 12:19:20 PM UTCrioSummaryNo html parser=>bad html parsing
    Tue 27 Feb 2007 10:13:16 PM UTCguentherStatusNone=>Need Info
    Show feedback again

    Back to the top


    Powered by Savane 3.1-cleanup