Friday, November 4, 2011

Importing Multiple Spreadsheets in SAS for Windows

What if you had to integrate data from a large number of spreadsheets scattered throughout a Windows file directory, including a number of subdirectories? Traversing a directory and collecting file paths might sound like a daunting proposal, but SAS can handle it. This post sketches out a possible solution in three steps.

Step one is to create a master list of all files in the directory, including all subdirectories:

filename DIRLIST pipe 'dir "C:\YOUR\FILE\LOCATION\" /s /b';
data dirlist;
     length buffer $256;
     infile DIRLIST length=reclen;
     input buffer $varying256. reclen;
run;

The above is a variation on SAS's solution to the problem, which is posted at http://support.sas.com/kb/24/820.html. The only difference in the above is that the MS DOS dir command contains /b, which allows the output table to populate with simple list of file paths. Read more about the dir command at http://www.computerhope.com/dirhlp.htm.

Step two is to create a subset of DIRLIST containing only the files to be accessed. If the desired files all have a common string in their file name, this PROC SQL statement will do the job:

PROC SQL;
CREATE TABLE myTable AS
SELECT * FROM DIRLIST WHERE buffer LIKE "%myString%";
QUIT;

If there is no common string in the file names, then your situation is more difficult. Nevertheless, chances are good that your data are all in one type of file, such as .xls, so you can create your subset of file paths based on whatever your extension may be:

PROC SQL;
CREATE TABLE myTable AS
SELECT * FROM DIRLIST WHERE buffer LIKE "%.xls%";
QUIT;

Step three is to loop through each one of the records in myTable, importing data from a single file on each pass. One way to create the outer loop is to adopt and adapt the SCANLOOP macro, which has been published at http://www2.sas.com/proceedings/sugi26/p093-26.pdf. A PROC IMPORT statement or DATA step (with INFILE statement) can be included within the SCANLOOP macro to pull in a file's data on each step - here is a possible example when each spreadsheet that you want to bring in contains two columns:

/* Filename statement to point to file to bring in */
/* LOC is macro variable containing file path */
filename MyImportFile "&LOC";
/* Import data into MyImportFile dataset */
DATA MyImportTEMP 
INFILE MyImportFile DLM=',|' MISSOVER DSD linesize=32767;
        /* Include needed number of variables */
LENGTH var1 $ 100 var2 $ 100;
INPUT var1 var2; 
RUN; 
/* Merge MyImportTEMP into MyImportALL dataset */
PROC DATASETS;
APPEND BASE=MyImportALL DATA=MyImportTEMP FORCE;
RUN; 

The above script will work when all of the files have some a common structure. If there is more than one type of file structure, one option would be to use one such PROC IMPORT or DATA step for each type, then use macro variables + IF/THEN logic to make sure that the correct one runs for each of your files.