trinity-users@lists.pearsoncomputing.net

Message: previous - next
Month: March 2016

Re: [trinity-users] Hopeing I can find a regex expert here

From: Gene Heskett <gheskett@...>
Date: Wed, 23 Mar 2016 09:35:31 -0400
On Wednesday 23 March 2016 07:22:03 E. Liddell wrote:

> On Wed, 23 Mar 2016 15:58:39 +0900
>
> Michele Calgaro <michele.calgaro@...> wrote:
> > On 2016/03/23 02:19 PM, Gene Heskett wrote:
> > > On Wednesday 23 March 2016 00:32:17 Michele Calgaro wrote:
> > >> On 2016/03/23 12:44 PM, Gene Heskett wrote:
> > >>> Greetings;
> > >>>
> > >>> I use mailfilter as a prefilter in front of fetchmail to nuke
> > >>> some spam while its still on the server.
> > >>>
> > >>> But its missing hits on what I suspect is the From: or
> > >>> Return-Path: strings that have quotation marks in the string
> > >>> because the string is being spec'd by being surrounded by "show
> > >>> this name" bs.
> > >>>
> > >>> I've added the character < as part of the string its to search
> > >>> for, so the search string now looks like
> > >>> "From:.*<*\.unwanted-tld".  Does this stand that famous snow
> > >>> balls chance in hell of working well with or without a quoted
> > >>> "some funkity name" in front of the real url with the <> around
> > >>> it?
> > >>>
> > >>> I just love the lack of documentation on how this string
> > >>> comparison stuff works as shown by the man pages for grep and
> > >>> regex.  All sorts of control options are well covered, but
> > >>> figureing out how to write a search expression must be one of
> > >>> the worlds better guarded secrets.
> > >>>
> > >>> So if someone could show me, or give a url that actually has the
> > >>> full docs, I'd be greatfull.
> > >>>
> > >>> Thanks.
> > >>>
> > >>> Cheers, Gene Heskett
> > >>
> > >> Hi Gene,
> > >> "From:.*<*\.unwanted-tld" will match a string like this (I have
> > >> put one section per line to be cleaer): From:
> > >> whatever character
> > >> 0 or more <
> > >> .unwanted-tld
> > >
> > > I thought I wanted 1 only, but the way these lowlifes change
> > > addresses and names hourly, they may remove the <> surrounding the
> > > real source address and screw me up.  But the fact that they often
> > > put dbl-qoutes around the throwaway part of the url, is I think
> > > screwing me regardless.
> > >
> > > What we need is the ability to specify the quote character by the
> > > first non-space character after the DENY =, which is currently a
> > > "^ or a <> which apparently inverts the logic.  So a typical line
> > > would be
> > >
> > > DENY = "^From:.*<*\.bid"
> > >
> > > Substitute any of the new tld's for bid that gets obnoxious.  Like
> > > xyz, or .pro, heck that new list is several dozen tld's.
> > >
> > > But AFAIK, we're stuck with the dblquote wrapper around the string
> > > to match.  Grrrr.
> > >
> > >> It is greedy, so it will scan until the last < if there are more
> > >> than one. Not sure if this is what you need or not. If you can
> > >> post an example of what you need to match, I can workout another
> > >> regex if required.
> > >
> > > Try this:
> > >
> > > "-Bed Bugs-" <-BedBugs-@...>
> > >
> > > with Return-Path.* or From.* in front of it.  Or does that - sign,
> > > 4 of them, need escaping with a \ ? IDK.
>
> Hyphens should only need an escape if within a character class,
> denoted by square brackets.
>
> > > I converted about 3 lines of the filterdata file that way, and I'm
> > > now waiting for the next blast of spam to serve as test data. 
> > > mailfilter is a picky twit, but that hasn't given it a tummy ache
> > > either, so I am hopefull.
> > >
> > >> PS: by the way, the internet is full of excellent documentation
> > >> about regex ;-) For example
> > >> "http://www.regular-expressions.info/"
> > >
> > > Cheers, Gene Heskett
> >
> > Hi Gene,
> > so if I understand correctly, you already had a set of rules like
> > DENY = "^From:.*\.bid"  (bid stands for any tld of yuor choice)
> > but it was missing some entries because of the "..." entry before
> > the domain. So you put the < in the string as well.
> > Right?
> >
> > Assuming so, it surprises me that the original version missed some
> > entries, since the additional "..." field would have already been
> > matched by the .* part of the pattern.
> > I think there is a different reason for missing entries. Perhaps a
> > black character before "From:"? Could it be? You could try this
> > other version:
> > DENY = "^\s*From:.*\.bid"  which ignores any separator before From:
>
> That would also sweep up, say, fred@..., or
> "I.bid" <ibid@...>
>
> > or
> > DENY = "^\s*From:.*\.bid>" which also makes explicit that the tld is
> > followed by a >.
>
> I'd cover the example as
>
> ^\W*((From:)|(Return-Path:)).*\.bid\W*$
>
> which works out to zero or more non-word characters  at the beginning
> of the string, followed by "From:" or "Return-Path:" followed by zero
> or more unknowns, followed by ".bid", followed by zero or more
> non-word characters, followed by the end of the string.  "Word"
> characters are alphanumerics, some connectors like _-, and possibly
> some non-ASCII depending on the implementation, so "non-word" covers
> stuff like punctuation and whitespace.  Marking the end of the string
> makes it more likely you're getting the TLD and not some random bit in
> the middle that was designed as a parser torture-test.
>
> If you want to get really silly,
>
> ^\W*((From:)|(Return-Path:)).*\.[^cCoOnN][a-zA-Z][a-zA-Z]+\W*$
>
> ought to catch the majority of TLDs with a 3+ ASCII character
> extension that isn't .com, .org, or .net, but without a larger sample
> of "good" and "bad" addresses, I can't guarantee no false positives.
>
> I write a lot of regexes in my day job (which is not to say that I get
> them right the first time, every time!)  Assuming a Perl-compatible
> implementation (which most of them are, more or less), "man perlre" is
> a decent reference for the complicated bits.  Just scroll past the
> section on modifiers.
>
> E. Liddell

Now that looks like the regex bible, Thanks a bunch.  That needs printed 
and placed in the middle of the house little room. :)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> trinity-users-unsubscribe@... For additional
> commands, e-mail: trinity-users-help@... Read
> list messages on the web archive:
> http://trinity-users.pearsoncomputing.net/ Please remember not to
> top-post:
> http://trinity.pearsoncomputing.net/mailing_lists/#top-posting


Cheers, Gene Heskett
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>