File_Extractor 1.0.0 -------------------- Author : Shay Green <gblargg@gmail.com> Website : http://code.google.com/p/file-extractor/ License : GNU LGPL 2.1 or later for all except unrar Language: C interface, C++ implementation Contents -------- * Overview * Limitations * Extracting file data * Archive file type handling * Using in multiple threads * Error handling * Solving problems * Thanks Overview -------- File_Exactor (fex) allows you to write one version of file-opening code that handles normal files and archives of files. It presents each as a series of files that you can scan and optionally extract; a single file is made to act like an archive of just one file, so your code doesn't need to do anything special to handle it. Basic steps for scanning and extracting from an archive: * Open an archive or normal file using fex_open(). * Scanning/extraction loop: - Exit loop if fex_done() returns true. - Get current file's name with fex_name(). - If more file information is needed, call fex_stat() first. - If extracting, use fex_data() or fex_read(). - Go to next file in archive with fex_next(). * Close archive and free memory with fex_close(). You can stop scanning an archive at any point, for example once you've found the file you're looking for. If you need to go back to the first file, call fex_rewind() at any time. Be sure to check error codes returned by most functions. Limitations ----------- All archives: * A file's checksum is verified only after ALL its data is extracted. * Encryption, segmentation, files larger than 2GB, and other extra features are not supported. GZ archives: * Only gzip archives of a single file are supported. If it has multiple files, the reported size will be wrong. Multi-file gzip archives are very rare. ZIP archives: * Supports files compressed using either deflation or store (uncompressed). Other compression schemes like BZip2 and Deflate64 are not supported. * Archive must have a valid directory structure at the end. RAR archives: * Support for really old 1.x archives might not work. If you have some of these old archives, send them to me so I can test them. 7-zip: * Solid archives can currently use lots of memory when open. Extracting file data -------------------- A file's data can be extracted with one or more calls to fex_read(), as you would read from a normal file. Use fex_tell() to find out how much has already been read. Use this if you need the data read into your own structure in memory. File data can also be extracted to memory by the library with fex_data(). The pointer returned is valid only until you go to another file or close the archive, so this is only useful if you need to examine or process the data immediately and not keep it around for later. Archive extractors naturally keep a copy of the extracted data in memory already for solid archive types (currently 7-zip and RAR), so this function is optimized to avoid making a second copy of it in those cases. Use fex_size() to find the size of the extracted data. Remember that fex_stat() or fex_data() must be called BEFORE calling fex_size(). Archive file type handling -------------------------- By default, fex uses the filename extension and header to determine archive type. If the filename extension is unrecognized or it lacks an extension, fex examines the first few bytes of the file. If still unrecognized, fex opens it as binary. Fex also checks for common archive types that it doesn't support, so that it can reject as unsupported them rather than unhelpfully opening them as binary. Your file format might itself be an archive, for example your files end in ".rsn" yet are normal RAR archives, or they end in ".vgz" and are gzipped. This is why fex checks the headers of files with unknown filename extensions, rather than treating them as binary or rejecting them. Type identification can be customized by using the various identification functions and fex_open_type(). For example, you could avoid the header check: fex_t* fex; fex_type_t type = fex_identify_extension( path ); if ( type == NULL ) error( "Unsupported archive type" ); error( fex_open_type( &fex, path, type ) ); Note that you'll only get a NULL type for known archive type that fex doesn't handle; you won't get it for your own files, for example fex_identify_extension("myfile.foo") won't return NULL (unless for some reason you've disabled binary file support). Use fex_type_list() to get a list of the types fex supports, for example to tell the user what archive types your program supports: const fex_type_t* t; for ( t = fex_type_list(); *t; t++ ) printf( "%s\n", fex_type_name( *t ) ); To get the fex_type_t for a particular archive type, use fex_identify_extension(): fex_type_t zip_type = fex_identify_extension( ".zip" ); if ( zip_type == NULL ) error( "ZIP isn't supported" ); Be sure to check the result as shown, rather than assuming the library supports a particular archive type. Use an extension of "" to get the type for binary files: fex_type_t bin_type = fex_identify_extension( "" ); if ( bin_type == NULL ) error( "Binary files aren't supported?!?" ); Using in multiple threads ------------------------- Fex supports multi-threaded programs. If only one thread at a time is using the library, nothing special needs to be done. If more than one thread is using the library, the following must be done: * Call fex_init() from the main thread and ensure it completes before any other threads use any fex functions. This initializes shared data tables used by the extractors. * For each archive opened, only access it from one thread at a time. Different archives can be accessed from different threads without any synchronization, since fex uses no global variables. If the same archive must be accessed from multiple threads, all calls to any fex functions must be in critical section(s). Unicode file paths on Windows ----------------------------- If using Windows and your program supports Unicode file paths, enable BLARGG_UTF8_PATHS in blargg_config.h, and convert your wide-character paths to UTF-8 before passing them to fex.h functions: /* Wide-character path that could have come from system */ wchar_t wide_path [] = L"demo.zip"; /* Convert from wide path and check for error */ char* path = fex_wide_to_path( wide_path ); if ( path == NULL ) error( "Out of memory" ); /* Use converted path for fex call */ error( fex_open( &fex, path ) ); /* Free memory used by path */ fex_free_path( path ); The converted path can be used with any of the fex functions that take paths, for example fex_identify_extension() or fex_has_extension(). Error handling -------------- Most functions that can fail return fex_err_t, a pointer type. On failure they return a pointer to an error object, and on success they return NULL. Use fex_err_code() to get a conventional error code, or fex_err_str() to get a string suitable for reporting to the user. There are two basic approches that your code can use to handle library errors. It can return errors, or report them and exit the function via some other means. Your code can return errors as the library does, using fex_err_t: #define RETURN_ERR( expr ) \ do {\ fex_err_t err = (expr);\ if ( err != NULL )\ return err;\ } while ( 0 ) fex_err_t my_func() { RETURN_ERR( fex_foo() ); RETURN_ERR( fex_bar() ); return NULL; } If you have your own error codes, you can convert fex's errors to them: // error codes that differ from library's enum { my_ok = 0, my_generic_error = 123, my_out_of_memory = 456, my_file_not_found = 789 // ... }; int convert_error( fex_err_t err ) { switch ( fex_err_code( err ) ) { case fex_ok: return my_ok; case fex_err_generic: return my_generic_error; case fex_err_memory: return my_out_of_memory; case fex_err_file_missing: return my_file_not_found; // ... default: return my_generic_error; } } #define RETURN_ERR( expr ) \ do {\ fex_err_t err = (expr);\ if ( err != NULL )\ return convert_error( err );\ } while ( 0 ) int my_func() { RETURN_ERR( fex_foo() ); RETURN_ERR( fex_bar() ); return my_ok; } The other approach is to pass all errors to an error handler function that never returns if passed a non-success error value: // never returns if err != NULL void handle_error( fex_err_t err ); void my_func() { handle_error( fex_foo() ); handle_error( fex_bar() ); } handle_error() could print the error and exit the program: void handle_error( fex_err_t err ) { if ( err != NULL ) { const char* str = fex_err_str( err ); printf( "Error: %s\n", str ); exit( EXIT_FAILURE ); } } handle_error() could also throw a C++ exception (or equivalently in C, longmp() back to a setjmp() done inside caller()): void handle_error( fex_err_t err ) { switch ( fex_err_code( err ) ) { case fex_ok: return; case fex_err_memory: throw std::bad_alloc(); // ... case fex_err_generic: default: throw std::runtime_error( fex_err_str( err ) ); } } void caller() { try { my_func(); } catch ( const std::exception& e ) { printf( "Error: %s\n", e.what() ); } } Solving problems ---------------- If you're having problems, try the following: * Enable debugging support in your environment. This enables assertions and other run-time checks. In particular, be sure NDEBUG isn't defined. * Turn the compiler's optimizer is off. Sometimes an optimizer generates bad code. * If multiple threads are being used, ensure that only one at a time is accessing a given set of objects from the library. This library is not in general thread-safe, though independent objects can be used in separate threads. * If all else fails, see if the demo works. Thanks ------ Thanks to Richard Bannister, Kode54, byuu, Cless, and DJRobX for testing and giving feedback for the library. Thanks to the authors of zlib, unrar, and 7-zip. -- Shay Green <gblargg@gmail.com>