171
8.3 fcntl: Locks and Other File Operations
8.3 fcntl: Locks and Other File Operations
The fcntl system call is the access point for several advanced operations on file
descriptors.The first argument to fcntl is an open file descriptor, and the second is a
value that indicates which operation is to be performed. For some operations, fcntl
takes an additional argument.We’ll describe here one of the most useful fcntl opera-
tions, file locking. See the fcntl man page for information about the others.
The fcntl system call allows a program to place a read lock or a write lock on a
file, somewhat analogous to the mutex locks discussed in Chapter 5,“Interprocess
Communication.”A read lock is placed on a readable file descriptor, and a write lock
is placed on a writable file descriptor. More than one process may hold a read lock on
the same file at the same time, but only one process may hold a write lock, and the
same file may not be both locked for read and locked for write. Note that placing a
lock does not actually prevent other processes from opening the file, reading from it,
or writing to it, unless they acquire locks with fcntl as well.
To place a lock on a file, first create and zero out a struct flock variable. Set the
l_type field of the structure to F_RDLCK for a read lock or F_WRLCK for a write lock.
Then call fcntl, passing a file descriptor to the file, the F_SETLCKW operation code, and
a pointer to the struct flock variable. If another process holds a lock that prevents a
new lock from being acquired, fcntl blocks until that lock is released.
The program in Listing 8.2 opens a file for writing whose name is provided on the
command line, and then places a write lock on it.The program waits for the user to
hit Enter and then unlocks and closes the file.
Listing 8.2 (lock-file.c) Create a Write Lock with fcntl
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
int main (int argc, char* argv[])
{
char* file = argv[1];
int fd;
struct flock lock;
printf (“opening %s\n”, file);
/* Open a file descriptor to the file. */
fd = open (file, O_WRONLY);
printf (“locking\n”);
/* Initialize the flock structure. */
memset (&lock, 0, sizeof(lock));
lock.l_type = F_WRLCK;
/* Place a write lock on the file. */
fcntl (fd, F_SETLKW, &lock);
continues
10 0430 Ch08 5/22/01 10:33 AM Page 171
172
Chapter 8 Linux System Calls
printf (“locked; hit Enter to unlock “);
/* Wait for the user to hit Enter. */
getchar ();
printf (“unlocking\n”);
/* Release the lock. */
lock.l_type = F_UNLCK;
fcntl (fd, F_SETLKW, &lock);
close (fd);
return 0;
}
Compile and run the program on a test file—say, /tmp/test-file—like this:
% cc -o lock-file lock-file.c
% touch /tmp/test-file
% ./lock-file /tmp/test-file
opening /tmp/test-file
locking
locked; hit Enter to unlock
Now, in another window, try running it again on the same file.
% ./lock-file /tmp/test-file
opening /tmp/test-file
locking
Note that the second instance is blocked while attempting to lock the file. Go back to
the first window and press Enter:
unlocking
The program running in the second window immediately acquires the lock.
If you prefer fcntl not to block if the call cannot get the lock you requested,
use F_SETLK instead of F_SETLKW. If the lock cannot be acquired, fcntl returns –1
immediately.
Linux provides another implementation of file locking with the flock call.The
fcntl version has a major advantage: It works with files on NFS
3
file systems (as long
as the NFS server is reasonably recent and correctly configured). So, if you have access
to two machines that both mount the same file system via NFS, you can repeat the
previous example using two different machines. Run lock-file on one machine,
specifying a file on an NFS file system, and then run it again on another machine,
specifying the same file. NFS wakes up the second program when the lock is released
by the first program.
3. Network File System (NFS) is a common network file sharing technology, comparable to
Windows’ shares and network drives.
Listing 8.2 Continued
10 0430 Ch08 5/22/01 10:33 AM Page 172
173
8.4 fsync and fdatasync: Flushing Disk Buffers
8.4 fsync and fdatasync: Flushing Disk Buffers
On most operating systems, when you write to a file, the data is not immediately
written to disk. Instead, the operating system caches the written data in a memory
buffer, to reduce the number of required disk writes and improve program responsive-
ness.When the buffer fills or some other condition occurs (for instance, enough time
elapses), the system writes the cached data to disk all at one time.
Linux provides caching of this type as well. Normally, this is a great boon to perfor-
mance. However, this behavior can make programs that depend on the integrity of
disk-based records unreliable. If the system goes down suddenly—for instance, due to a
kernel crash or power outage—any data written by a program that is in the memory
cache but has not yet been written to disk is lost.
For example, suppose that you are writing a transaction-processing program that
keeps a journal file.The journal file contains records of all transactions that have been
processed so that if a system failure occurs, the state of the transaction data can be
reconstructed. It is obviously important to preserve the integrity of the journal file—
whenever a transaction is processed, its journal entry should be sent to the disk drive
immediately.
To help you implement this, Linux provides the fsync system call. It takes one
argument, a writable file descriptor, and flushes to disk any data written to this file.
The fsync call doesn’t return until the data has physically been written.
The function in Listing 8.3 illustrates the use of fsync. It writes a single-line entry
to a journal file.
Listing 8.3 (write_journal_entry.c) Write and Sync a Journal Entry
#include <fcntl.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
const char* journal_filename = “journal.log”;
void write_journal_entry (char* entry)
{
int fd = open (journal_filename, O_WRONLY
| O_CREAT | O_APPEND, 0660);
write (fd, entry, strlen (entry));
write (fd, “\n”, 1);
fsync (fd);
close (fd);
}
Another system call, fdatasync does the same thing. However, although fsync guaran-
tees that the file’s modification time will be updated, fdatasync does not; it guarantees
only that the file’s data will be written.This means that in principal, fdatasync can
execute faster than fsync because it needs to force only one disk write instead of two.
10 0430 Ch08 5/22/01 10:33 AM Page 173
174
Chapter 8 Linux System Calls
However, in current versions of Linux, these two system calls actually do the same
thing, both updating the file’s modification time.
The fsync system call enables you to force a buffer write explicitly.You can also
open a file for synchronous I/O, which causes all writes to be committed to disk imme-
diately.To do this, specify the O_SYNC flag when opening the file with the open call.
8.5 getrlimit and setrlimit: Resource Limits
The getrlimit and setrlimit system calls allow a process to read and set limits on the
system resources that it can consume.You may be familiar with the ulimit shell com-
mand, which enables you to restrict the resource usage of programs you run;
4
these
system calls allow a program to do this programmatically.
For each resource there are two limits, the hard limit and the soft limit.The soft limit
may never exceed the hard limit, and only processes with superuser privilege may
change the hard limit.Typically, an application program will reduce the soft limit to
place a throttle on the resources it uses.
Both getrlimit and setrlimit take as arguments a code specifying the resource
limit type and a pointer to a structrlimit variable.The getrlimit call fills the fields
of this structure, while the setrlimit call changes the limit based on its contents.The
rlimit structure has two fields: rlim_cur is the soft limit, and rlim_max is the hard
limit.
Some of the most useful resource limits that may be changed are listed here, with
their codes:
n
RLIMIT_CPU—The maximum CPU time, in seconds, used by a program.This is
the amount of time that the program is actually executing on the CPU, which is
not necessarily the same as wall-clock time. If the program exceeds this time
limit, it is terminated with a SIGXCPU signal.
n
RLIMIT_DATA—The maximum amount of memory that a program can allocate
for its data. Additional allocation beyond this limit will fail.
n
RLIMIT_NPROC—The maximum number of child processes that can be running
for this user. If the process calls fork and too many processes belonging to this
user are running on the system, fork fails.
n
RLIMIT_NOFILE—The maximum number of file descriptors that the process may
have open at one time.
See the
175
8.6 getrusage: Process Statistics
Listing 8.4 (limit-cpu.c) CPU Time Limit Demonstration
#include <sys/resource.h>
#include <sys/time.h>
#include <unistd.h>
int main ()
{
struct rlimit rl;
/* Obtain the current limits. */
getrlimit (RLIMIT_CPU, &rl);
/* Set a CPU limit of 1 second. */
rl.rlim_cur = 1;
setrlimit (RLIMIT_CPU, &rl);
/* Do busy work. */
while (1);
return 0;
}
When the program is terminated by SIGXCPU, the shell helpfully prints out a message
interpreting the signal:
% ./limit_cpu
CPU time limit exceeded
8.6 getrusage : Process Statistics
The getrusage system call retrieves process statistics from the kernel. It can be used to
obtain statistics either for the current process by passing RUSAGE_SELF as the first argu-
ment, or for all terminated child processes that were forked by this process and its
children by passing RUSAGE_CHILDREN.The second argument to rusage is a pointer
to a struct rusage variable, which is filled with the statistics.
A few of the more interesting fields in
struct rusage are listed here:
n
ru_utime—A struct timeval field containing the amount of user time, in sec-
onds, that the process has used. User time is CPU time spent executing the user
program, rather than in kernel system calls.
n
ru_stime—A struct timeval field containing the amount of system time, in sec-
onds, that the process has used. System time is the CPU time spent executing
system calls on behalf of the process.
n
ru_maxrss—The largest amount of physical memory occupied by the process’s
data at one time over the course of its execution.
The getrusage man page lists all the available fields. See Section 8.7, “gettimeofday:
Wall-Clock Time,” for information about
struct timeval.
10 0430 Ch08 5/22/01 10:33 AM Page 175
176
Chapter 8 Linux System Calls
The function in Listing 8.5 prints out the current process’s user and system time.
Listing 8.5 (print-cpu-times.c) Display Process User and System Times
#include <stdio.h>
#include <sys/resource.h>
#include <sys/time.h>
#include <unistd.h>
void print_cpu_time()
{
struct rusage usage;
getrusage (RUSAGE_SELF, &usage);
printf (“CPU time: %ld.%06ld sec user, %ld.%06ld sec system\n”,
usage.ru_utime.tv_sec, usage.ru_utime.tv_usec,
usage.ru_stime.tv_sec, usage.ru_stime.tv_usec);
}
8.7 gettimeofday: Wall-Clock Time
The gettimeofday system call gets the system’s wall-clock time. It takes a pointer to a
struct timeval variable.This structure represents a time, in seconds, split into two
fields.The tv_sec field contains the integral number of seconds, and the tv_usec field
contains an additional number of microseconds.This struct timeval value represents
the number of seconds elapsed since the start of the UNIX epoch, on midnight UTC
on January 1, 1970.The gettimeofday call also takes a second argument, which should
be NULL. Include <sys/time.h> if you use this system call.
The number of seconds in the UNIX epoch isn’t usually a very handy way of rep-
resenting dates.The localtime and strftime library functions help manipulate the
return value of gettimeofday.The localtime function takes a pointer to the number
of seconds (the tv_sec field of struct timeval) and returns a pointer to a struct tm
object.This structure contains more useful fields, which are filled according to the
local time zone:
n
tm_hour, tm_min, tm_sec—The time of day, in hours, minutes, and seconds.
n
tm_year, tm_mon, tm_day—The year, month, and date.
n
tm_wday—The day of the week. Zero represents Sunday.
n
tm_yday—The day of the year.
n
tm_isdst—A flag indicating whether daylight savings time is in effect.
The strftime function additionally can produce from the struct tm pointer a cus-
tomized, formatted string displaying the date and time.The format is specified in a
manner similar to printf, as a string with embedded codes indicating which time
fields to include. For example, this format string
“%Y-%m-%d %H:%M:%S”
10 0430 Ch08 5/22/01 10:33 AM Page 176
177
8.8 The mlock Family: Locking Physical Memory
specifies the date and time in this form:
2001-01-14 13:09:42
Pass strftime a character buffer to receive the string, the length of that buffer, the for-
mat string, and a pointer to a struct tm variable. See the strftime man page for a
complete list of codes that can be used in the format string. Notice that neither
localtime nor strftime handles the fractional part of the current time more precise
than 1 second (the tv_usec field of struct timeval). If you want this in your format-
ted time strings, you’ll have to include it yourself.
Include <time.h> if you call localtime or strftime.
The function in Listing 8.6 prints the current date and time of day, down to the
millisecond.
Listing 8.6 (print-time.c) Print Date and Time
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
void print_time ()
{
struct timeval tv;
struct tm* ptm;
char time_string[40];
long milliseconds;
/* Obtain the time of day, and convert it to a tm struct. */
gettimeofday (&tv, NULL);
ptm = localtime (&tv.tv_sec);
/* Format the date and time, down to a single second. */
strftime (time_string, sizeof (time_string), “%Y-%m-%d %H:%M:%S”, ptm);
/* Compute milliseconds from microseconds. */
milliseconds = tv.tv_usec / 1000;
/* Print the formatted time, in seconds, followed by a decimal point
and the milliseconds. */
printf (“%s.%03ld\n”, time_string, milliseconds);
}
8.8 The mlock Family: Locking Physical
Memory
The mlock family of system calls allows a program to lock some or all of its address
space into physical memory.This prevents Linux from paging this memory to swap
space, even if the program hasn’t accessed it for a while.
10 0430 Ch08 5/22/01 10:33 AM Page 177
178
Chapter 8 Linux System Calls
A time-critical program might lock physical memory because the time delay of
paging memory out and back may be too long or too unpredictable. High-security
applications may also want to prevent sensitive data from being written out to a swap
file, where they might be recovered by an intruder after the program terminates.
Locking a region of memory is as simple as calling mlock with a pointer to the start
of the region and the region’s length. Linux divides memory into pages and can lock
only entire pages at a time; each page that contains part of the memory region speci-
fied to mlock is locked.The getpagesize function returns the system’s page size, which
is 4KB on x86 Linux.
For example, to allocate 32MB of address space and lock it into RAM, you would
use this code:
const int alloc_size = 32 * 1024 * 1024;
char* memory = malloc (alloc_size);
mlock (memory, alloc_size);
Note that simply allocating a page of memory and locking it with mlock doesn’t
reserve physical memory for the calling process because the pages may be copy-on-
write.
5
Therefore, you should write a dummy value to each page as well:
size_t i;
size_t page_size = getpagesize ();
for (i = 0; i < alloc_size; i += page_size)
memory[i] = 0;
The write to each page forces Linux to assign a unique, unshared memory page to the
process for that page.
To unlock a region, call munlock, which takes the same arguments as mlock.
If you want your program’s entire address space locked into physical memory, call
mlockall. This system call takes a single flag argument: MCL_CURRENT locks all currently
allocated memory, but future allocations are not locked; MCL_FUTURE locks all pages that
are allocated after the call. Use MCL_CURRENT|MCL_FUTURE to lock into memory both
current and subsequent allocations.
Locking large amounts of memory, especially using mlockall, can be dangerous to
the entire Linux system. Indiscriminate memory locking is a good method of bringing
your system to a grinding halt because other running processes are forced to compete
for smaller physical memory resources and swap rapidly into and back out of memory
(this is known as thrashing). If you lock too much memory, the system will run out of
memory entirely and Linux will start killing off processes.
For this reason, only processes with superuser privilege may lock memory with
mlock or mlockall. If a nonsuperuser process calls one of these functions, it will fail,
return –1, and set errno to EPERM.
The munlockall call unlocks all memory locked by the current process, including
memory locked with mlock and mlockall.
5. Copy-on-write means that Linux makes a private copy of a page of memory for a process
only when that process writes a value somewhere into it.
10 0430 Ch08 5/22/01 10:33 AM Page 178
179
8.9 mprotect: Setting Memory Permissions
A convenient way to monitor the memory usage of your program is to use the top
command. In the output from top, the SIZE column displays the virtual address space
size of each program (the total size of your program’s code, data, and stack, some of
which may be paged out to swap space).The RSS column (for resident set size) shows
the size of physical memory that each program currently resides in.The sum of all the
RSS values for all running programs cannot exceed your computer’s physical memory
size, and the sum of all address space sizes is limited to 2GB (for 32-bit versions of
Linux).
Include <sys/mman.h> if you use any of the mlock system calls.
8.9 mprotect: Setting Memory Permissions
In Section 5.3,“Mapped Memory,” we showed how to use the mmap system call to
map a file into memory. Recall that the third argument to mmap is a bitwise or of
memory protection flags PROT_READ, PROT_WRITE, and PROT_EXEC for read, write, and
execute permission, respectively, or PROT_NONE for no memory access. If a program
attempts to perform an operation on a memory location that is not allowed by these
permissions, it is terminated with a SIGSEGV (segmentation violation) signal.
After memory has been mapped, these permissions can be modified with the
mprotect system call.The arguments to mprotect are an address of a memory region,
the size of the region, and a set of protection flags.The memory region must consist of
entire pages:The address of the region must be aligned to the system’s page size, and
the length of the region must be a page size multiple.The protection flags for these
pages are replaced with the specified value.
Obtaining Page-Aligned Memory
Note that memory regions returned by malloc are typically not page-aligned, even if the size of the
memory is a multiple of the page size. If you want to protect memory obtained from malloc, you will
have to allocate a larger memory region and find a page-aligned region within it.
Alternately, you can use the mmap system call to bypass malloc and allocate page-aligned memory
directly from the Linux kernel. See Section 5.3, “Mapped Memory,” for details.
For example, suppose that your program allocates a page of memory by mapping
/dev/zero, as described in Section 5.3.5,“Other Uses for mmap.”The memory is ini-
tially both readable and writable.
int fd = open (“/dev/zero”, O_RDONLY);
char* memory = mmap (NULL, page_size, PROT_READ
| PROT_WRITE,
MAP_PRIVATE, fd, 0);
close (fd);
Later, your program could make the memory read-only by calling mprotect:
mprotect (memory, page_size, PROT_READ);
10 0430 Ch08 5/22/01 10:33 AM Page 179
180
Chapter 8 Linux System Calls
An advanced technique to monitor memory access is to protect the region of memory
using mmap or mprotect and then handle the SIGSEGV signal that Linux sends to the
program when it tries to access that memory.The example in Listing 8.7 illustrates this
technique.
Listing 8.7 (mprotect.c) Detect Memory Access Using mprotect
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
static int alloc_size;
static char* memory;
void segv_handler (int signal_number)
{
printf (“memory accessed!\n”);
mprotect (memory, alloc_size, PROT_READ
| PROT_WRITE);
}
int main ()
{
int fd;
struct sigaction sa;
/* Install segv_handler as the handler for SIGSEGV. */
memset (&sa, 0, sizeof (sa));
sa.sa_handler = &segv_handler;
sigaction (SIGSEGV, &sa, NULL);
/* Allocate one page of memory by mapping /dev/zero. Map the memory
as write-only, initially. */
alloc_size = getpagesize ();
fd = open (“/dev/zero”, O_RDONLY);
memory = mmap (NULL, alloc_size, PROT_WRITE, MAP_PRIVATE, fd, 0);
close (fd);
/* Write to the page to obtain a private copy. */
memory[0] = 0;
/* Make the memory unwritable. */
mprotect (memory, alloc_size, PROT_NONE);
/* Write to the allocated memory region. */
memory[0] = 1;
10 0430 Ch08 5/22/01 10:33 AM Page 180
Không có nhận xét nào:
Đăng nhận xét