Showing posts with label Data cleaning. Show all posts
Showing posts with label Data cleaning. Show all posts

Thursday, December 4, 2014

My favorite SAS proc: proc freq

No doubt, my favorite proc step is proc freq. It is extremely simple, but therein lies its beauty: it's fast to run and easy to understand. With just two lines of code (almost zero effort), it delivers a set of frequency output for all variables in one printout that anyone can understand, regardless of their involvement with the actual data set. I've found raw proc freq output to be a terrific starting point for conversations on data cleaning or data exploration, no matter who is involved in the conversation:

PROC freq data=NAME;
run;

Friday, July 6, 2012

Sample "Where" Clauses for Finding Dates in Text Fields

The following where clauses use SAS PRXMatch functions and simple Perl regular expressions to help find dates within SAS dataset text fields.

Identify records that contain number/number, e.g. 1/2, 12/25, etc.:
where prxmatch('*\d\/\d*', fieldname) >= 1;

Identify records that contain number-number, e.g. 1-2, 12-25, etc.:
where prxmatch('*\d-\d*', fieldname) >= 1;

Identify records that contain four-digit numbers, e.g. 2007:
where prxmatch('*\d\d\d\d*', fieldname) >= 1;

Identify records that contain the word "July":
where prxmatch('*July*', fieldname) >= 1;

Or identify records that contain any of the above, any month name, or any month three-letter abbreviation:
where prxmatch('*\d\/\d*', fieldname) >= 1
OR prxmatch('*\d-\d*', fieldname) >= 1
OR prxmatch('*\d\d\d\d*', fieldname) >= 1
OR prxmatch('*January*', fieldname) >= 1
OR prxmatch('*Feburary*', fieldname) >= 1
OR prxmatch('*March*', fieldname) >= 1
OR prxmatch('*April*', fieldname) >= 1
OR prxmatch('*May*', fieldname) >= 1
OR prxmatch('*June*', fieldname) >= 1
OR prxmatch('*July*', fieldname) >= 1
OR prxmatch('*August*', fieldname) >= 1
OR prxmatch('*September*', fieldname) >= 1
OR prxmatch('*October*', fieldname) >= 1
OR prxmatch('*November*', fieldname) >= 1
OR prxmatch('*December*', fieldname) >= 1
OR prxmatch('*Jan*', fieldname) >= 1
OR prxmatch('*Feb*', fieldname) >= 1
OR prxmatch('*Mar*', fieldname) >= 1
OR prxmatch('*Apr*', fieldname) >= 1
OR prxmatch('*May*', fieldname) >= 1
OR prxmatch('*Jun*', fieldname) >= 1
OR prxmatch('*Jul*', fieldname) >= 1
OR prxmatch('*Aug*', fieldname) >= 1
OR prxmatch('*Sep*', fieldname) >= 1
OR prxmatch('*Oct*', fieldname) >= 1
OR prxmatch('*Nov*', fieldname) >= 1
OR prxmatch('*Dec*', fieldname) >= 1;


Or, a macro that will print any records that match the above:
%macro dsearch(tablename, fieldname);
    proc print data = &tablename;
        var &fieldname;
        where prxmatch('*\d\/\d*', &fieldname) >= 1
            OR prxmatch('*\d-\d*', &fieldname) >= 1
            OR prxmatch('*\d\d\d\d*', &fieldname) >= 1
            OR prxmatch('*January*', &fieldname) >= 1
            OR prxmatch('*Feburary*', &fieldname) >= 1
            OR prxmatch('*March*', &fieldname) >= 1
            OR prxmatch('*April*', &fieldname) >= 1
            OR prxmatch('*May*', &fieldname) >= 1
            OR prxmatch('*June*', &fieldname) >= 1
            OR prxmatch('*July*', &fieldname) >= 1
            OR prxmatch('*August*', &fieldname) >= 1
            OR prxmatch('*September*', &fieldname) >= 1
            OR prxmatch('*October*', &fieldname) >= 1
            OR prxmatch('*November*', &fieldname) >= 1
            OR prxmatch('*December*', &fieldname) >= 1
            OR prxmatch('*Jan*', &fieldname) >= 1
            OR prxmatch('*Feb*', &fieldname) >= 1
            OR prxmatch('*Mar*', &fieldname) >= 1
            OR prxmatch('*Apr*', &fieldname) >= 1
            OR prxmatch('*May*', &fieldname) >= 1
            OR prxmatch('*Jun*', &fieldname) >= 1
            OR prxmatch('*Jul*', &fieldname) >= 1
            OR prxmatch('*Aug*', &fieldname) >= 1
            OR prxmatch('*Sep*', &fieldname) >= 1
            OR prxmatch('*Oct*', &fieldname) >= 1
            OR prxmatch('*Nov*', &fieldname) >= 1
            OR prxmatch('*Dec*', &fieldname) >= 1
    ;
    run;
%mend dsearch;