nearestSentence Function

I wanted to be able to take a string of text, a blog entry in this case, and crop it down to the nearest sentence to a given maximum length value. I am sure this function exists in php, but I couldn’t find it so I wrote my own and figured some of you might find this handy:

function nearestSentence($s) {
//Max length of string before
//looking for last sentence.
$maxL = 350;

//Strip html tags and convert
//html entities (like &)
//to single characters before counting.
$toCount = strip_tags(html_entity_decode($s));

//Crop the string down to the max length.
if (strlen($toCount) <= $maxL) {
$s2 = $s;
} else {
$s2 = substr($toCount, 0, $maxL);
}
//Look for position of the last ., ?, or !
$lastPunct = max(strrpos($s2, '.'), strrpos($s2, '?'), strrpos($s2, '!'));
//Crop the string again down to the nearest punctuation.
$s3 = substr($s2, 0, $lastPunct+1);
//Return string with html entites re-inserted.
return htmlentities($s3);
}

So, if you had some text like:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat? Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis! At vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla?

Then, you would get:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat?

If you have some flexibility for the length of your blurb, I think this looks a lot better than just cropping your string down to a particular length and adding... but maybe that's just me.

4 comments on “nearestSentence Function

This is one of those times where regular expressions are more powerful than you expect. The following regexp is the key:

/^(.+?[.?!])/

This basically says: give me the shortest string that ends with dot, question mark or exclamation mark. The smart bit is the +?: it’s like the standard + (one or more characters) but it prefers shorter rather than longer strings. So, for example, whereas a regexp like /(.+b)/ will match all of “baaaabaaabaab”, a regexp like /(.+?b)/ will match just the “baaaab” part by default.

So, the following code should do roughly the same as yours:

function nearestSentence($s, $max=350)
{
if (preg_match('/^(.+?[.?!])/',$s,$matches))
$s = $matches[1];
if ($max && strlen($s) > $max)
return substr($s, 0, $max-3) . "...";
return $s;
}

This is off the top of my head, so I haven’t tested it, but it’s a start. Note the optional $max parameter; leave it out to default to a maximum length of 350, or set it to zero or null to have no maximum.

I threw in the “…” just for the hell of it, but I took out the to/from HTML stuff, which I think belongs in a separate routine. This way, the function works for any text, not just HTML.

Brian says:

Without checking Eric’s code, he hit the nail on the head. Regular expersions are powerful tools for parsing strings.

Also, you can strip html in another function and call it all on one line…

nearestSentence(strip_tags(html_entity_decode($s)), 350);

Your function should do one thing. Later on, when you have more complex code, keeping your functions limited to one task will make it easier debug or reuse.

Eric, that’s great! I’ve been guilty of skirting around the use of regular expressions for a LONG time now. In fact, given a problem, I tend to find every perceivable solution BUT regular expressions…just ask the birdman, he knows.

My initial thoughts about striping the html and removing the special characters, was that if you are employing a max value, this is essential to the function. I can see now how moving the html bit into a separate function would help keep the code compartmentalized. If html is being parsed by this function though, it is essential that you remove the html though unless you want to write another function to close the html tags you left open by chomping the sting off at the last sentence.

Nothing understood ))

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to the top!