How to iterate over a Hierarchical file structure in C (more or less...)

In which I automate bulk database entry for files in many different (although well-organised) locations.
The Problem:
In amassing a data set for this project, I gathered a large number of image files clipped from various sources. To keep them all nice and organized, each one was kept on disk in its appropriate spot within a set of hierarchically organized folders.
That is, until one day when I tried to count them. I can’t say for certain, but it seems they must have been breeding... In any event, it was time to find a better way to keep track of them. Hence, I decided to dump them into a relational database!
Since there were so many of them, there was no way I wanted to enter each one by hand. That’s when I came up with the clever plan to write a small program to loop over my file structure, extract the relevant information, and drop each image into a simple, little table. After all, how hard could that be?
Well, actually...quite hard.
The Solution:
To elaborate: I wound up having to use C++ in order to take advantage of the file management functions implemented by MSDN, https://msdn.microsoft.com/en-us/library/windows/desktop/aa364232(v=vs.85).aspx, since corresponding libraries for file traversal don’t appear to be available for plain old C. Also for reference, I was working in Windows, using the CodeBlocks IDE as well as MariaDB (an open source drop-in replacement for MySQL) for my database.
The Explanation:
Although I hope my program is commented well enough - and it seems to make sense (more or less) to me, here’s a little example that I hope clarifies its major functionality.
As an example, take a bunch of pictures of different types of fruit. On disk, they might be stored as follows:
Here, the top level of the structure separates the images into major fruit types (apple, orange, etc.), the second level further subdivides them by varietal, and images are stored at the bottom of this structure. Therefore, all the relevant information (in this example, fruit type and varietal) can be determined for each image just by walking back up the tree.
Now for the code itself:
Assuming that you have a MySQL C connector set up, I have included the instructions to make CodeBlocks play nicely with it in the top comment of my code, and underneath that all the necessary libraries (in an order that makes things not blow up!). However, the rest of the code you will likely want to modify to suit your own purposes.
I first set up a structure to temporarily store an image’s information in sucha way that it corresponds to the columns of the database.
struct imgAttributes /* Struct to collect file attributes for entry in SQL table */ { char col1[5] = {0}; char col2[50] = {0}; char col3[100] = {0}; char col4[10] = {0}; char col5[50] = {0}; char fullPath[MAX_PATH] = {0}; };
Within the main() function, I first set up a basePath variable linking to the location of the “Full Database” which is not altered at any point in the code, a currentPath variable to track your position within the file structure, and a tempPath variable that is identical to currentPath with a “\\\*” terminator appended so that it can be used with the file management functions.
In the next chunk, I made lists of all the folders at level 1 (i.e. “fruit type”) and level 2 (i.e. “varietal”) of the hierarchy. For my purposes - and NOT like the above fruit example - the level 2 directories were the same within each level 1 folder, so I just set them in advance. If this isn’t the case for you, that step should be moved down into the main loop so that the Lvl2Array can be repopulated for each different level 1 directory.
Also, any call to a Find or Count Files function will provide a list of all the files and folders in that location as well as two additional entries for “.” (current location) and “..” (up a level).
// Find number of folders(Level 1) in the Full Database Directory // includes "."(current location) and ".." (up a level) int numLvl1Dirs = 0; numLvl1Dirs = countFiles(tempPath, hFind, \&ffd); char \*Lvl1Array [numLvl1Dirs - 2] = {0}; populateArray(tempPath, Lvl1Array, numLvl1Dirs, hFind, \&ffd); // Change to first Level 1 directory setPath(currentPath, basePath, MAX\_PATH, Lvl1Array[0]); StringCchCopy(tempPath, MAX\_PATH, currentPath); StringCchCat(tempPath, MAX\_PATH, TEXT("\\\\\*")); // Find number of folders in its subdirectory (Level 2) int numLvl2Dirs = 0; numLvl2Dirs = countFiles(tempPath, hFind, \&ffd); char \*Lvl2Array [numLvl2Dirs - 2] = {0}; populateArray(tempPath, Lvl2Array, numLvl2Dirs, hFind, \&ffd);
At that point, the MySQL connection is set up, and the path is reset to the base path to prepare for looping down to each image file. Since this program anticipates being run multiple times on the same file structure after adding extra images it is set up to check the relational database for duplicates and skip those files; the numDBmatches variable tracks how many duplicates were found.
From there, all that’s left is to enter the main loop and search for image files! The loop itself populates a temporary imgAttributes struct, flushes that information into a table and zeros the struct out for re-use on the following image. The only exciting thing left is reading the image file for insertion into the database. I chose to do this since the images I’m working on are fairly small, and any manipulations I may make on them do not affect the original database! The code for that resides in the else condition on lines 147-279.
Finally, some of the labels I entered into my database could not be read straightforwardly from the folder and image names, so I included a couple of extra functions towards the bottom to take care of those tasks. They names all start with setCol\# - feel free to take them out as best suits you!
I hope this helps someone else out there!
Add new comment