Gem: hpricot
This commit is contained in:
284
vendor/gems/hpricot-0.6/README
vendored
Normal file
284
vendor/gems/hpricot-0.6/README
vendored
Normal file
@@ -0,0 +1,284 @@
|
||||
= Hpricot, Read Any HTML
|
||||
|
||||
Hpricot is a fast, flexible HTML parser written in C. It's designed to be very
|
||||
accommodating (like Tanaka Akira's HTree) and to have a very helpful library
|
||||
(like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS
|
||||
parser, in fact, is based on John Resig's JQuery.
|
||||
|
||||
Also, Hpricot can be handy for reading broken XML files, since many of the same
|
||||
techniques can be used. If a quote is missing, Hpricot tries to figure it out.
|
||||
If tags overlap, Hpricot works on sorting them out. You know, that sort of
|
||||
thing.
|
||||
|
||||
*Please read this entire document* before making assumptions about how this
|
||||
software works.
|
||||
|
||||
== An Overview
|
||||
|
||||
Let's clear up what Hpricot is.
|
||||
|
||||
# Hpricot is *a standalone library*. It requires no other libraries. Just Ruby!
|
||||
# While priding itself on speed, Hpricot *works hard to sort out bad HTML* and
|
||||
pays a small penalty in order to get that right. So that's slightly more important
|
||||
to me than speed.
|
||||
# *If you can see it in Firefox, then Hpricot should parse it.* That's
|
||||
how it should be! Let me know the minute it's otherwise.
|
||||
# Primarily, Hpricot is used for reading HTML and tries to sort out troubled
|
||||
HTML by having some idea of what good HTML is. Some people still like to use
|
||||
Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that!
|
||||
|
||||
== The Hpricot Kingdom
|
||||
|
||||
First, here are all the links you need to know:
|
||||
|
||||
* http://code.whytheluckystiff.net/hpricot is the Hpricot wiki and bug tracker.
|
||||
Go there for news and recipes and patches. It's the center of activity.
|
||||
* http://code.whytheluckystiff.net/svn/hpricot/trunk is the main Subversion
|
||||
repository for Hpricot. You can get the latest code there.
|
||||
* http://code.whytheluckystiff.net/doc/hpricot is the home for the latest copy of
|
||||
this reference.
|
||||
* See COPYING for the terms of this software. (Spoiler: it's absolutely free.)
|
||||
|
||||
If you have any trouble, don't hesitate to contact the author. As always, I'm
|
||||
not going to say "Use at your own risk" because I don't want this library to be
|
||||
risky. If you trip on something, I'll share the liability by repairing things
|
||||
as quickly as I can. Your responsibility is to report the inadequacies.
|
||||
|
||||
== Installing Hpricot
|
||||
|
||||
You may get the latest stable version from Rubyforge. Win32 binaries and source
|
||||
gems are available.
|
||||
|
||||
$ gem install hpricot
|
||||
|
||||
As Hpricot is still under active development, you can also try the most recent
|
||||
candidate build here:
|
||||
|
||||
$ gem install hpricot --source http://code.whytheluckystiff.net
|
||||
|
||||
The development gem is usually in pretty good shape actually. You can also
|
||||
get the bleeding edge code or plain Ruby tarballs on the wiki.
|
||||
|
||||
== An Hpricot Showcase
|
||||
|
||||
We're going to run through a big pile of examples to get you jump-started.
|
||||
Many of these examples are also found at
|
||||
http://code.whytheluckystiff.net/hpricot/wiki/HpricotBasics, in case you
|
||||
want to add some of your own.
|
||||
|
||||
=== Loading Hpricot Itself
|
||||
|
||||
You have probably got the gem, right? To load Hpricot:
|
||||
|
||||
require 'rubygems'
|
||||
require 'hpricot'
|
||||
|
||||
If you've installed the plain source distribution, go ahead and just:
|
||||
|
||||
require 'hpricot'
|
||||
|
||||
=== Load an HTML Page
|
||||
|
||||
The <tt>Hpricot()</tt> method takes a string or any IO object and loads the
|
||||
contents into a document object.
|
||||
|
||||
doc = Hpricot("<p>A simple <b>test</b> string.</p>")
|
||||
|
||||
To load from a file, just get the stream open:
|
||||
|
||||
doc = open("index.html") { |f| Hpricot(f) }
|
||||
|
||||
To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby:
|
||||
|
||||
require 'open-uri'
|
||||
doc = open("http://qwantz.com/") { |f| Hpricot(f) }
|
||||
|
||||
Hpricot uses an internal buffer to parse the file, so the IO will stream
|
||||
properly and large documents won't be loaded into memory all at once. However,
|
||||
the parsed document object will be present in memory, in its entirety.
|
||||
|
||||
=== Search for Elements
|
||||
|
||||
Use <tt>Doc.search</tt>:
|
||||
|
||||
doc.search("//p[@class='posted']")
|
||||
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
|
||||
|
||||
<tt>Doc.search</tt> can take an XPath or CSS expression. In the above example,
|
||||
all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt>
|
||||
attribute of <tt>"posted"</tt>.
|
||||
|
||||
A shortcut is to use the divisor:
|
||||
|
||||
(doc/"p.posted")
|
||||
#=> #<Hpricot:Elements[{p ...}, {p ...}]>
|
||||
|
||||
=== Finding Just One Element
|
||||
|
||||
If you're looking for a single element, the <tt>at</tt> method will return the
|
||||
first element matched by the expression. In this case, you'll get back the
|
||||
element itself rather than the <tt>Hpricot::Elements</tt> array.
|
||||
|
||||
doc.at("body")['onload']
|
||||
|
||||
The above code will find the body tag and give you back the <tt>onload</tt>
|
||||
attribute. This is the most common reason to use the element directly: when
|
||||
reading and writing HTML attributes.
|
||||
|
||||
=== Fetching the Contents of an Element
|
||||
|
||||
Just as with browser scripting, the <tt>inner_html</tt> property can be used to
|
||||
get the inner contents of an element.
|
||||
|
||||
(doc/"#elementID").inner_html
|
||||
#=> "..<b>contents</b>.."
|
||||
|
||||
If your expression matches more than one element, you'll get back the contents
|
||||
of ''all the matched elements''. So you may want to use <tt>first</tt> to be
|
||||
sure you get back only one.
|
||||
|
||||
(doc/"#elementID").first.inner_html
|
||||
#=> "..<b>contents</b>.."
|
||||
|
||||
=== Fetching the HTML for an Element
|
||||
|
||||
If you want the HTML for the whole element (not just the contents), use
|
||||
<tt>to_html</tt>:
|
||||
|
||||
(doc/"#elementID").to_html
|
||||
#=> "<div id='elementID'>...</div>"
|
||||
|
||||
=== Looping
|
||||
|
||||
All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop
|
||||
through them like you would an array.
|
||||
|
||||
(doc/"p/a/img").each do |img|
|
||||
puts img.attributes['class']
|
||||
end
|
||||
|
||||
=== Continuing Searches
|
||||
|
||||
Searches can be continued from a collection of elements, in order to search deeper.
|
||||
|
||||
# find all paragraphs.
|
||||
elements = doc.search("/html/body//p")
|
||||
# continue the search by finding any images within those paragraphs.
|
||||
(elements/"img")
|
||||
#=> #<Hpricot::Elements[{img ...}, {img ...}]>
|
||||
|
||||
Searches can also be continued by searching within container elements.
|
||||
|
||||
# find all images within paragraphs.
|
||||
doc.search("/html/body//p").each do |para|
|
||||
puts "== Found a paragraph =="
|
||||
pp para
|
||||
|
||||
imgs = para.search("img")
|
||||
if imgs.any?
|
||||
puts "== Found #{imgs.length} images inside =="
|
||||
end
|
||||
end
|
||||
|
||||
Of course, the most succinct ways to do the above are using CSS or XPath.
|
||||
|
||||
# the xpath version
|
||||
(doc/"/html/body//p//img")
|
||||
# the css version
|
||||
(doc/"html > body > p img")
|
||||
# ..or symbols work, too!
|
||||
(doc/:html/:body/:p/:img)
|
||||
|
||||
=== Looping Edits
|
||||
|
||||
You may certainly edit objects from within your search loops. Then, when you
|
||||
spit out the HTML, the altered elements will show.
|
||||
|
||||
(doc/"span.entryPermalink").each do |span|
|
||||
span.attributes['class'] = 'newLinks'
|
||||
end
|
||||
puts doc
|
||||
|
||||
This changes all <tt>span.entryPermalink</tt> elements to
|
||||
<tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways
|
||||
of doing this. Such as the <tt>set</tt> method:
|
||||
|
||||
(doc/"span.entryPermalink").set(:class => 'newLinks')
|
||||
|
||||
=== Figuring Out Paths
|
||||
|
||||
Every element can tell you its unique path (either XPath or CSS) to get to the
|
||||
element from the root tag.
|
||||
|
||||
The <tt>css_path</tt> method:
|
||||
|
||||
doc.at("div > div:nth(1)").css_path
|
||||
#=> "div > div:nth(1)"
|
||||
doc.at("#header").css_path
|
||||
#=> "#header"
|
||||
|
||||
Or, the <tt>xpath</tt> method:
|
||||
|
||||
doc.at("div > div:nth(1)").xpath
|
||||
#=> "/div/div:eq(1)"
|
||||
doc.at("#header").xpath
|
||||
#=> "//div[@id='header']"
|
||||
|
||||
== Hpricot Fixups
|
||||
|
||||
When loading HTML documents, you have a few settings that can make Hpricot more
|
||||
or less intense about how it gets involved.
|
||||
|
||||
== :fixup_tags
|
||||
|
||||
Really, there are so many ways to clean up HTML and your intentions may be to
|
||||
keep the HTML as-is. So Hpricot's default behavior is to keep things flexible.
|
||||
Making sure to open and close all the tags, but ignore any validation problems.
|
||||
|
||||
As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt
|
||||
to shift the document's tags to meet XHTML 1.0 Strict.
|
||||
|
||||
doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
|
||||
|
||||
This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
|
||||
the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's
|
||||
going to move the paragraph below the link. Or up and out of other elements
|
||||
where paragraphs don't belong.
|
||||
|
||||
If an unknown element is found, it is ignored. Again, <tt>:fixup_tags</tt>.
|
||||
|
||||
== :xhtml_strict
|
||||
|
||||
So, let's go beyond just trying to fix the hierarchy. The
|
||||
<tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML
|
||||
1.0 Strict document. Even at the cost of removing elements that get in the way.
|
||||
|
||||
doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
|
||||
|
||||
What measures does <tt>:xhtml_strict</tt> take?
|
||||
|
||||
1. Shift elements into their proper containers just like :fixup_tags.
|
||||
2. Remove unknown elements.
|
||||
3. Remove unknown attributes.
|
||||
4. Remove illegal content.
|
||||
5. Alter the doctype to XHTML 1.0 Strict.
|
||||
|
||||
== Hpricot.XML()
|
||||
|
||||
The last option is the <tt>:xml</tt> option, which makes some slight variations
|
||||
on the standard mode. The main difference is that :xml mode won't try to output
|
||||
tags which are friendlier for browsers. For example, if an opening and closing
|
||||
<tt>br</tt> tag is found, XML mode won't try to turn that into an empty element.
|
||||
|
||||
XML mode also doesn't downcase the tags and attributes for you. So pay attention
|
||||
to case, friends.
|
||||
|
||||
The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
|
||||
|
||||
doc = open("http://redhanded.hobix.com/index.xml") do |f|
|
||||
Hpricot.XML(f)
|
||||
end
|
||||
|
||||
*Also, :fixup_tags is canceled out by the :xml option.* This is because
|
||||
:fixup_tags makes assumptions based how HTML is structured. Specifically, how
|
||||
tags are defined in the XHTML 1.0 DTD.
|
||||
Reference in New Issue
Block a user