Why can we not insert into files without the additional writes? (I neither mean append, nor over-write)

This occurs as a programming language independent problem to me.

I have a file with the content

aaabddd

When I want to insert C behind b then my code needs to rewrite ddd to get

aaabCddd

Why can I not just insert C at this position?

I can not do this in Java, Python, … . I can not do this in Linux, Windows, … . Am I right?

I do not understand why C can’t simply be inserted without the additional writes. Would someone please explain why this is so?

Given that most file systems store the contents of files in individual blocks that are not necessarily contiguous on the physical disk, but linked via pointer structures, it seems that such a mode – “inserting” rather than “appending” or “overwriting” – ought to be possible, and could certainly be made more efficient than what we have to do now: read the entire content, edit the stream of bytes and re-write the entire content.

However, for better or worse, the UNIX semantics of file systems were designed along the “rough and simple” paradigm in the 1970s: it allows you to do everything, but not necessarily in the most efficient way possible. Nowadays it is almost unthinkable to introduce a new file opening mode into the Virtual File System layer and have any hope of the major file systems adopting support for it. This is a pet peeve of mine, but unfortunately one unlikely to be resolved any time soon.

Theoretically, you could implement a file that would allow this sort of thing. For maximum flexibility, though, you’d need to store a pointer to the next byte along with every byte in the file. Assuming a 64-bit pointer, that would mean that 8 of every 9 bytes of your file would be composed of internal pointers. So it would take 9000 bytes of space to store 1000 bytes of actual data. Reading the file would also be slow since you’d need to read each byte, read the pointer, follow the pointer to read the next byte, etc. rather than reading large, contiguous blocks of data from disk.

Obviously, this sort of approach isn’t practical. You could, though, split the file up into, say 32 kb blocks. That would make it relatively easy to add 32 kb of data at any 32 kb boundary in the file. It wouldn’t make it any easier to add a single byte as the 5th byte of the file. If you reserve some free space in every block, though, you could allow small additions of data to take place that would only affect the data in that single block. You’d have a penalty in terms of file size, of course, but potentially a reasonable one. Figuring out how much space to reserve and how to split blocks, though, tends to be much easier for a particular application than for a general-purpose system– what works in one context may be very bad in another depending on the file access and modification characteristics.

In fact, many systems that spend a lot of time interacting with files implement something like what I’ve described above when they implement their particular file abstraction. Databases, for example, will generally implement some concept of a “block” as the smallest unit of I/O that they can work with and will generally reserve some amount of space for future growth so that updating a row in a table only affects the one block on which that data is stored rather than rewriting the entire file. Different databases, of course, have different implementations with different trade-offs.

The “problem” boils down to how files are written out to the storage medium in a byte by byte fashion.

In it’s most basic representation, a file is nothing more than a series of bytes written out to the disk (aka storage medium). So your original string looks like:

Address  Value
0x00     `a`
0x01     `a`
0x02     `a`
0x03     `b`
0x04     `d`
0x05     `d`
0x06     `d`

And you want to insert C at position 0x04. That requires shifting bytes 4 – 6 down one byte so you can insert the new value. If you don’t, you’re going to over-write the value that’s currently at 0x04 which is not what you want.

Address  Value
0x00     `a`
0x01     `a`
0x02     `a`
0x03     `b`
0x04     `C`
0x05     `d`
0x06     `d`
0x07     `d`

So the reason why you have to re-write the tail of the file after you insert a new value is because there isn’t any space within the file to accept the inserted value. Otherwise you would over-write what was there.

Addendum 1: If you wanted to replace the value of b with C then you do not need to re-write the tail of the string. Replacing a value with a like sized value doesn’t require a rewrite.

Addendum 2: If you wanted to replace the string ab with C then you would need to re-write the rest of the file as you’ve created a gap in the file.

Addendum 3: Block level constructs were created to make handling large files easier to deal with. Instead of having to find 1M worth of contiguous space for your file, you now only need to find 1M worth of available blocks to write to instead.

In theory, you could construct a filesystem that did byte-by-byte linking similar to what blocks provide. Then you could insert a new byte by updating the to | from pointers at the appropriate point. I would hazard a guess that the performance on that would be pretty poor.

As Grandmaster B suggested, use a picture of stacked dominoes to visually understand how the file is represented.

You can’t insert another domino within the line of dominoes without causing everything to tumble over. You have to create the space for the new domino by moving the others down the line. Moving dominoes down the line is the equivalent of re-writing the tail of the file after the insertion point.

The most efficient way to insert a block of bytes in the middle of a file would be to:

Map the file to memory
Append the bytes at the end of the memory image of the file
Rotate these files into place (with a standard algorithm available in the C++ Standard Library, for instance)
Let the OS take care of writing dirty blocks to disk

First you need to read everything after the insert point, then write it back down by as much space as you’re going to insert. Then you can write your “insert” data into the correct place. Extremely poor performance operation, hence not natively supported.

When you do direct access to a file you are using a low level that can be used to build more sophisticated structures. Consider building a database with your data that allows the types of access you need, including insertion.

It would be less expensive if you only need to iterate through the file not do random accesses to a specified offset. If you need random access by offset in file you will need to update the index for all bytes beyond the insertion point.

In general, you will pay in indexing data structures, memory to store the index, and extra disk accesses to update it.

Insertion into a file is not implemented in most file systems because it is considered an “expensive” (time-eating and space eating) operation with potentially long-term “expensive” repercussions and additional failure modes.

A file system with insertion semantics would probably either use shift&insert (potentially very expensive when you insert at the front of a large file, but no/few long-term side-effects) or some sort of generalized heap allocation with variable length allocation sizes (very ill-behaved performance in some cases [imagine the interactive users’ faces if they try to save a file during a stop-the-world GC!]).

If you want to experiment, you can easily build a file I/O abstraction in Java or Python that implements insertion. If you succeed and it has well-behaved performance characteristics, you have the basis for an excellent research paper. Good luck.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 05:44

Thẻ: file-systems, files, operating-systems, programming-languages

Thiết kế website giá rẻ

Danh mục

Why can we not insert into files without the additional writes? (I neither mean append, nor over-write)