collapse Blogs I Read
collapse Table of Contents
  1. Jonathan Pryor's web log
    1. Announcing NDesk.Options 0.2.1
      1. Usage
      2. What's New?
    2. Threading: Lock Nesting
    3. HackWeek Summary
    4. Announcing NDesk.Options 0.2.0
      1. Usage
      2. What's New?
    5. Unix Signal Handling In C#
    6. Announcing NDesk.Options 0.1.0
      1. Usage
      2. What's New?
    7. Mono and Mixed Mode Assembly Support
    8. So you want to parse a command line...
    9. In Defense Of git
    10. Comparing Java and C# Generics
      1. What Are Generics
      2. Terminology
      3. Generic Methods
      4. Constraints
        1. Java Type Constraints
        2. C# Constraints
      5. Java Wildcards (Java Method Constraints)
        1. Unbounded Wildcards
        2. Bounded Wildcards
        3. C# Equivalents
      6. Generics Implementation
        1. Java Implementation
        2. .NET Implementation
      7. Runtime Environment
        1. Java Runtime Environment
        2. C# Runtime Environment
      8. Summary
      9. Links
    11. Problems with Traditional Object Oriented Ideas
    12. Re-Introducing monodocer
      1. Monodocer
      2. monodocer -importecmadoc
      3. Optimizing monodocer -importecmadoc
      4. Conclusion
    13. Mono.Fuse 0.4.2
      1. Aside: A Walk through Mono.Posix History
      2. Download
      3. GIT Repository
    14. POSIX Says The Darndest Things
    15. Mono.Fuse 0.4.1
      1. Mac OS X HOWTO
      2. Known Issues
      3. Download
      4. GIT Repository
    16. When Comparisons Fail
    17. openSUSE 10.2 Windows Key Solution
    18. openSUSE 10.2 Windows Key Workaround
    19. openSUSE 10.2 Complaints
      1. Drive Partitioning
      2. Using the Windows key
    20. Care and Feeding of openSUSE 10.2
      1. IP Masquerading/Network Address Translation (NAT)
      2. HTTP Server with mod_userdir
      3. Windows Shares
    21. Novell, Microsoft, & Patents
    22. Mono.Fuse 0.4.0
      1. API Changes from the previous release:
      2. Download
      3. GIT Repository
    23. Naming, Mono.Fuse Documentation
    24. Mono.Fuse 0.3.0
      1. API Changes from the previous release:
      2. Download
      3. GIT Repository
    25. Miguel's ReflectionFS
    26. Mono.Fuse, Take 2.1!
    27. Mono.Fuse, Take 2!
    28. Announcing Mono.Fuse
      1. Why?
      2. What about SULF?
      3. Implementation
        1. mono
        2. mcs
      4. HOWTO
      5. Questions
    29. Performance Comparison: IList<T> Between Arrays and List<T>
    30. Reverse Engineering
    31. Programming Language Comparison
    32. System.Diagnostics Tracing Support
    33. Mono.Unix Reorganization
    34. Major Change to Nullable Types
    35. Frogger under Mono
    36. Mono.Unix Documentation Stubs

Jonathan Pryor's web log

Announcing NDesk.Options 0.2.1

I am pleased to announce the release of NDesk.Options 0.2.1. NDesk.Options is a C# program option parser library, inspired by Perl's Getopt::Long option parser.

To download, visit the NDesk.Options web page:

http://www.ndesk.org/Options

Usage

See http://www.ndesk.org/Options and the OptionSet documentation for examples.

What's New?

There have been several minor changes since the previous 0.2.0 release:

Posted on 20 Oct 2008 | Path: /development/ndesk.options/ | Permalink

Threading: Lock Nesting

a.k.a. Why the Java 1.0 collections were rewritten...

Threading is an overly complicated subject, covered in great detail at other locations and in many books. However, there is one subject that either I haven't seen discussed too often, or somehow have managed to miss while reading the plethora of threading sources, something I'll call lock nesting depth:

lock nesting depth
The number of locks that must be acquired and held simultaneously in order to perform a given operation.

In general, the lock nesting depth should be kept as small as possible; anything else results in extra, possibly unnecessary/extraneous locks, which serve only to slow down performance for no added benefit.

First, an aside: why does threading code require locks? To maintain data invariants for data shared between threads, preventing the data from being corrupted. Note that this is not necessarily the same as producing "correct" data, as there may be internal locks to prevent internal data corruption but the resulting output may not be "correct" (in as much as it isn't the output that we want).

The prototypical example of "non-corrupting but not correct" output is when multiple threads write to the (shared) terminal:

using System;
using System.Threading;

class Test {
	public static void Main ()
	{
		Thread[] threads = new Thread[]{
			new Thread ( () => { WriteMessage ("Thread 1"); } ),
			new Thread ( () => { WriteMessage ("Thread 2"); } ),
		};
		foreach (var t in threads)
			t.Start ();
		foreach (var t in threads)
			t.Join ();
	}

	static void WriteMessage (string who)
	{
		Console.Write ("Hello from ");
		Console.Write (who);
		Console.Write ("!\n");
	}
}

Output for the above program can vary from the sensible (and desirable):

$ mono ls.exe 
Hello from Thread 2!
Hello from Thread 1!
$ mono ls.exe 
Hello from Thread 1!
Hello from Thread 2!

To the downright "corrupt":

Hello from Hello from Hello from Hello from Thread 2!
Thread 1!

(This can happen when Thread 1 is interrupted by Thread 2 before it can write out its entire message.)

Notice what's going on here: as far as the system is concerned, what we're doing is safe -- no data is corrupted, my terminal/shell/operating system/planet isn't going to go bonkers, everything is well defined. It's just that in this circumstance "well defined" doesn't match what I, as the developer/end user, desired to see: one of the first two sets of output.

The solution, as always, is to either add a a lock within WriteMessage to ensure that the output is serialized as desired:

	static object o = new object ();
	static void WriteMessage (string who)
	{
		lock (o) {
			Console.Write ("Hello from ");
			Console.Write (who);
			Console.Write ("!\n");
		}
	}

Or to instead ensure that the message can't be split up, working within the predefined semantics of the terminal:

	static void WriteMessage (string who)
	{
		string s = "Hello from " + who + "!\n";
		Console.Write (s);
	}

(Which can oddly generate duplicate messages on Mono; not sure what's up with that... More here.)

For the WriteMessage that uses locks, the lock nesting depth is 2, and this can't be readily improved (because Console.Write is static, and thus must be thread safe as any thread could execute it at any time).

Returning to this entry's subtitle, why were the Java 1.0 collections rewritten? Because they were all internally thread safe. This had it's uses, should you be sharing a Hashtable or Vector between threads, but even then it was of limited usefulness, as it only protected the internal state for a single method call, not any state that may require more than one function call. Consider this illustrative code which counts the number of times a given token is encountered:

Hashtable data = new Hashtable ();
for (String token : tokens) {
    if (data.containsKey (token)) {
        Integer n = (Integer) data.get (token);
        data.put (token, new Integer (n.intValue() + 1));
    }
    else {
        data.put (token, new Integer (1));
    }
}

Yes, Hashtable is thread safe and thus won't have its data corrupted, but it can still corrupt your data should multiple threads execute this code against a shared data instance, as there is a race with the data.containsKey() call, where multiple threads may evaluate the same token "simultaneously" (read: before the following data.put call), and thus each thread would try to call data.put (token, new Integer (1)). The result: a missed token.

The solution is obvious: another lock, controlled by the developer, must be used to ensure valid data:

Object lock = new Object ();
Hashtable data = new Hashtable ();
for (String token : tokens) {
    synchronized (lock) {
        if (data.containsKey (token)) {
            Integer n = (Integer) data.get (token);
            data.put (token, new Integer (n.intValue() + 1));
        }
        else {
            data.put (token, new Integer (1));
        }
    }
}

Consequently, for all "non-trivial" code (where "non-trivial" means "requires more than one method to be called on the collection object in an atomic fashion") will require a lock nesting depth of two. Furthermore, the lock nesting depth would always be at least one, and since many functions were not invoked between multiple threads, or the collection instance local to that particular method, the synchronization within the collection was pure overhead, providing no benefit.

Which is why in Java 1.2, all of the new collection classes such as ArrayList and HashMap are explicitly unsynchronized, as are all of the .NET 1.0 and 2.0 collection types unless you use a synchronized wrapper such as System.Collections.ArrayList.Synchronized (which, again, is frequently of dubious value if you ever need to invoke more than one method against the collection atomically).

Finally, the Threading Design Guidelines of the .NET Framework Design Guidelines for Class Library Developers (book) suggests that all static members be thread safe, but instance member by default should not be thread safe:

Instance state does not need to be thread safe. By default, class libraries should not be thread safe. Adding locks to create thread-safe code decreases performance, increases lock contention, and creates the possibility for deadlock bugs to occur. In common application models, only one thread at a time executes user code, which minimizes the need for thread safety. For this reason, the .NET Framework class libraries are not thread safe by default.

Obviously, there are exceptions -- for example, if a static method returns a shared instance of some class, then all of those instance members must be thread safe as they can be accessed via the static method (System.Reflection.Assembly must be thread safe, as an instance of Assembly is returned by the static method Assembly.GetExecutingAssembly). By default, though, instance members should not be thread safe.

Posted on 27 May 2008 | Path: /development/ | Permalink

HackWeek Summary

In case you missed it, last week was "Hackweek" at Novell.

My week was less "hacking" and more "spit-and-polish." In particular:

I had wanted to do other things as well, such as migrate the monodoc-related programs to use NDesk.Options instead of Mono.GetOptions for option parsing, but such efforts will have to wait until later...

Posted on 19 Feb 2008 | Path: /development/ | Permalink

Announcing NDesk.Options 0.2.0

I am pleased to announce the release of NDesk.Options 0.2.0. NDesk.Options is a C# program option parser library, inspired by Perl's Getopt::Long option parser.

To download, visit the NDesk.Options web page:

http://www.ndesk.org/Options

Usage

See http://www.ndesk.org/Options and the OptionSet documentation for examples.

What's New?

There have been numerous changes since the previous 0.1.0 release:

Posted on 14 Feb 2008 | Path: /development/ndesk.options/ | Permalink

Unix Signal Handling In C#

In the beginning, Unix introduced signal(2), which permits a process to respond to external "stimuli", such as a keyboard interrupt (SIGINT), floating-point error (SIGFPE), dereferencing the NULL pointer (SIGSEGV), and other asynchronous events. And lo, it was...well, acceptable, really, but there wasn't anything better, so it at least worked. (Microsoft, when faced with the same problem of allowing processes to perform some custom action upon an external stimuli, invented Structured Exception Handling.)

Then, in a wrapping binge, I exposed it for use in C# with Stdlib.signal(), so that C# code could register signal handlers to be invoked when a signal occurred.

The problem? By their very nature, signals are asynchronous, so even in a single-threaded program, you had to be very careful about what you did, as your "normal" thread was certainly in the middle of doing something. For example, calling malloc(3) was almost certainly a bad idea, because if the process was in the middle of a malloc call already, you'd have a reentrant malloc call which could corrupt the heap.

This reentrant property impacts all functions in the process, including system calls. Consequently, a list of functions that were "safe" for invocation from signal handlers was standardized, and is listed in the above signal man page; it includes functions such as read(2) and write(2), but not functions like e.g. pwrite(2).

Consequently, these limitations and a few other factors led to the general recommendation that signal handlers should be as simple as possible, such as writing to global variable which the main program occasionally polls.

What's this have to do with Stdlib.signal(), and why was it a mistake to expose it? The problem is the P/Invoke mechanism, which allows marshaling C# delegates as a function pointer that can be invoked from native code. When the function pointer is invoked, the C# delegate is eventually executed.

However, before the C# delegate can be executed, a number of of steps needs to be done first:

  1. The first thing it does is to ensure the application domain for the thread where the signal handler executes actually matches the appdomain the delegate comes from, if it isn't it may need to set it and do several things that we can't guarantee are signal context safe...
  2. If the delegate is of an instance method we also need to retrieve the object reference, which may require taking locks...

In the same email, lupus suggests an alternate signal handling API that would be safe to use from managed code. Later, I provided a possible implementation. It amounts to treating the UnixSignal instance as a glorified global variable, so that it can be polled to see if the signal has been generated:

UnixSignal signal = new UnixSignal (Signum.SIGINT);
while (!signal.IsSet) {
  /* normal processing */
}

There is also an API to permit blocking the current thread until the signal has been emitted (which also accepts a timeout):

UnixSignal signal = new UnixSignal (Signum.SIGINT);
// Wait for SIGINT to be generated within 5 seconds
if (signal.WaitOne (5000, false)) {
    // SIGINT generated
}

Groups of signals may also be waited on:

UnixSignal[] signals = new UnixSignal[]{
    new UnixSignal (Signum.SIGINT),
    new UnixSignal (Signum.SIGTERM),
};

// block until a SIGINT or SIGTERM signal is generated.
int which = UnixSignal.WaitAny (signals, -1);

Console.WriteLine ("Got a {0} signal!", signals [which].Signum);

This isn't as powerful as the current Stdlib.signal() mechanism, but it is safe to use, doesn't lead to potentially ill-defined or unwanted behavior, and is the best that we can readily provide for use by managed code.

Mono.Unix.UnixSignal is now in svn-HEAD and the mono-1-9 branch, and should be part of the next Mono release.

Posted on 08 Feb 2008 | Path: /development/mono/ | Permalink

Announcing NDesk.Options 0.1.0

I am pleased to announce the release of NDesk.Options 0.1.0. NDesk.Options is a C# program option parser library, inspired by Perl's Getopt::Long option parser.

To download, visit the NDesk.Options web page:

http://www.ndesk.org/Options

Usage

See http://www.ndesk.org/Options and the OptionSet documentation for examples.

What's New?

There have been numerous changes since the previous prototype release:

Posted on 27 Jan 2008 | Path: /development/ndesk.options/ | Permalink

Mono and Mixed Mode Assembly Support

An occasional question on #mono@irc.gnome.org and ##csharp@irc.freenode.net is whether Mono will support mixed-mode assemblies, as generated by Microsoft's Managed Extensions for C++ compiler (Visual Studio 2001, 2003), and C++/CLI (Visual Studio 2005, 2008).

The answer is no, and mixed mode assemblies will likely never be supported.

Why?

First, what's a mixed mode assembly? A mixed mode assembly is an assembly that contains both managed (CIL) and unmanaged (machine language) code. Consequently, they are not portable to other CPU instruction sets, just like normal C and C++ programs and libraries.

Next, why use them? The primary purpose for mixed mode assemblies is as "glue", to e.g. use a C++ library class as a base class of a managed class. This allows the managed class to extend unmanaged methods, allowing the managed code to be polymorphic with respect to existing unmanaged functions. This is extremely useful in many contexts. However, as something like this involves extending a C++ class, it requires that the compiler know all about the C++ compiler ABI (name mangling, virtual function table generation and placement, exception behavior), and thus effectively requires native code. If the base class is within a separate .dll, this will also require that the mixed mode assembly list the native .dll as a dependency, so that the native library is also loaded when the assembly is loaded.

The other thing that mixed mode assemblies support is the ability to export new C functions so that other programs can LoadLibrary() the assembly and GetProcAddress the exported C function.

Both of these capabilities require that the shared library loader for the platform support Portable Executable (PE) files, as assemblies are PE files. If the shared library loader supports PE files, then the loader can ensure that when the assembly is loaded, all listed dependent libraries are also loaded (case 1), or that native apps will be able to load the assembly as if it were a native DLL and resolve DLL entry points against it.

This requirement is met on Windows, which uses the PE file format for EXE and DLL files. This requirement is not met on Linux, which uses ELF, nor is it currently met on Mac OS X, which uses Mach-O.

So why can't mixed mode assemblies be easily supported in Mono? Because ld.so doesn't like PE.

The only workarounds for this would be to either extend assemblies so that ELF files can contain both managed and unmanaged code, or to extend the shared library loader to support the loading of PE files. Using ELF as an assembly format may be useful, but would restrict portability of such ELF-assemblies to only Mono/Linux; .NET could never make use of them, nor could Mono on Mac OS X. Similarly, extending the shared library loader to support PE could be done, but can it support loading both PE and ELF (or Mach-O) binaries into a single process? What happens if a PE file loaded into an "ELF" process requires KERNEL32.DLL? Extending the shared library loader isn't a panacea either.

This limitation makes mixed mode assemblies of dubious value. It is likely solvable, but there are for more important things for Mono to focus on.

Posted on 27 Jan 2008 | Path: /development/mono/ | Permalink

So you want to parse a command line...

If you develop command-line apps, parsing the command-line is a necessary evil (unless you write software so simple that it doesn't require any options to control its behavior). Consequently, I've written and used several parsing libraries, including Mono.GetOptions, Perl's Getopt::Long library, and some custom written libraries or helpers.

So what's wrong with them? The problem with Mono.GetOptions is that it has high code overhead: in order to parse a command line, you need a new type (which inherits from Mono.GetOptions.Options) and annotate each field or property within the type with an Option attribute, and let Mono.GetOptions map each command-line argument to a field/property within the Options subclass. See monodocer for an example; search for Opts to find the subclass.

The type-reflector parser is similarly code heavy, if only in a different way. The Mono.Fuse, lb, and omgwtf parsers are one-offs, either specific to a particular environment (e.g. integration with the FUSE native library) or not written with any eye toward reuse.

Which leaves Perl's Getopt::Long library, which I've used for a number of projects, and quite like. It's short, concise, requires no object overhead, and allows seeing at a glance all of the options supported by a program:

use Getopt::Long;
my $data    = "file.dat";
my $help    = undef;
my $verbose = 0;

GetOptions (
	"file=s"    => \$data,
	"v|verbose" => sub { ++$verbose; },
	"h|?|help"  => $help
);

The above may be somewhat cryptic at first, but it's short, concise, and lets you know at a glance that it takes three sets of arguments, one of which takes a required string parameter (the file option).

So, says I, what would it take to provide similar support in C#? With C# 3.0 collection initializers and lambda delegates, I can get something that feels rather similar to the above GetOpt::Long code:

string data = null;
bool help   = false;
int verbose = 0;

var p = new Options () {
	{ "file=",      (v) => data = v },
	{ "v|verbose",  (v) => { ++verbose } },
	{ "h|?|help",   (v) => help = v != null },
};
p.Parse (argv).ToArray ();

Options.cs has the goods, plus unit tests and additional examples (via the tests).

Options is both more and less flexible than Getopt::Long. It doesn't support providing references to variables, instead using a delegate to do all variable assignment. In this sense, Options is akin to Getopt::Long while requiring that all options use a sub callback (as the v|verbose option does above).

Options is more flexible in that it isn't restricted to just strings, integers, and floating point numbers. If there is a TypeConverter registered for your type (to perform string->object conversions), then any type can be used as an option value. To do so, merely declare that type within the callback:

int count = 0;

var p = new Options () {
	{ "c|count=", (int v) => count = v },
};

As additional crack, you can provide an (optional) description of the option so that Options can generate help text for you:

var p = new Options () {
	{ "really-long-option", "description", (v) => {} },
	{ "h|?|help", "print out this message and exit", (v) => {} },
};
p.WriteOptionDescriptions (Console.Out);

would generate the text:

      --really-long-option   description
  -h, -?, --help             print out this message and exit

Options currently supports:

All un-handled parameters are returned from the Options.Parse method, which is implemented as an iterator (hence the calls to .ToArray() in the above C# examples, to force processing).

Posted on 07 Jan 2008 | Path: /development/mono/ | Permalink

In Defense Of git

On Friday at the OpenOffice.org Conference, we had two sessions discussing the future of Source Code Managers in OpenOffice.org: Child workspaces and the OOo SCM system by Jens-Heiner Rechtien and git: the Source Code Manager for OOo? by Jan Holesovsky (kendy).

In the Q&A section after the git presentation, there was a lot of heated debate in which it seemed that Jan and Jens were talking "past" each other. As a git backer, I thought I'd try to bring some clarity to things.

It seemed that Jens has one fundamental problem with git, which itself is fundamental to its operation: commits are not transferred to the remote module; instead, you need an explicit git-push command to send all local changes to the remote repository. Jens claimed three implications of this (that I remember):

  1. git did not permit line-by-line authorship information, as with cvs annotate or svn blame.
  2. Developers would not see changes made by other developers as soon as they happen.
  3. QA and Release Engineering wouldn't be alerted as soon as developers made any change on any child workspace.

The line-by-line authorship information is possible in git with the git blame or git annotate commands (they are synonyms for each other). I suspect I misinterpreted this part of the debate, as all parties should have known that git supported this.

Which leaves the other two issues, which (again) are fundamental to git: a commit does not send any data to the repository. Thus we get to the title of this blog entry: this is a Good Thing™.

Local commits are world changing in a very small way: they're insanely fast, much faster than Subversion. (For example, committing a one-line change to a text file under a Subversion remote directory took me 4.775s; a similar change under git is 0.246s -- 19x faster -- and this is a small Subversion module, ~1.5MB, hosted on the ximian.com Subversion repo, which never seems as loaded as the openoffice.org servers.)

What can you do when your commits are at least 19x faster? You commit more often. You commit when you save your file (or soon thereafter). You commit when you code is 99.995% guaranteed to be WRONG.

Why do this? Because human memory is limited. Most studies show that the average person can remember 7±2 items at a time before they start forgetting things. This matters because a single bug may require changes to multiple different files, and even within a single file your memory will be filled with such issues as what's the scope of this variable?, what's the type of this variable?, what's this method do?, what bug am I trying to fix again?, etc. Human short-term memory is very limited.

So what's the poor developer to do? Most bugs can be partitioned in some way, e.g. into multiple methods or blocks of code, and each such block/sub-problem is solved sequentially -- you pick one sub-problem, solve it, test it (individually if possible), and continue to the next sub-problem. During this process and when you're finished you'll review the patch (is it formatted nicely?, could this code be cleaned up to be more maintainable?), then finally commit your single patch to the repository. It has to be done this way because if you commit at any earlier point in time, someone else will get your intermediate (untested) changes, and you'll break THEIR code flow. This is obviously bad.

During this solve+test cycle, I frequently find that I'll make a set of changes to a file, save it, make other changes, undo them, etc. I never close my file, because (and here's the key point) cvs diff shows me too many changes. It'll show me the changes I made yesterday as well as the changes I made 5 minutes ago, and I need to keep those changes separate -- the ones from yesterday (probably) work, the ones from 5 minutes ago (probably) don't, and the only way I can possibly remember which is the set from 5 minutes ago is to hit Undo in my editor and find out. :-)

So git's local commits are truly world-changing for me: I can commit something as soon as I have it working for a (small) test case, at which point I can move on to related code and fix that sub-problem, even (especially) if it's a change in the same file. I need an easy way to keep track of which are the solved problems (the stuff I fixed yesterday) and the current problem. I need this primarily because the current problem filled my 7±2 memory slots, and I'm unable to easily remember what I did yesterday. (I'm only human! And "easily remember" means "takes less than 0.1s to recall." If you need to think you've already lost.)

This is why I think the other two issues -- developers don't see other changes instantly, and neither does QA -- are a non-issue. It's a feature.

So let's bring in a well-used analogy to programming: writing a book. You write a paragraph, spell check it, save your document, go onto another paragraph/chapter, repeat for a bit, then review what was written. At any part of this process, you'll be ready to Undo your changes because you changed your mind. Changes may need to occur across the entire manuscript.

Remote commits are equivalent to sending each saved manuscript to the author's editor. If someone is going to review/use/depend upon your change, you're going to Damn Well make sure that it Works/is correct before you send that change.

Which brings us to the workflow dichotomy between centralized source code managers (cvs, svn) and distributed managers (git et. al). Centralized source managers by design require more developer effort, because the developer needs to manually track all of the individual changes of a larger work/patch before sending it upstream (as described above).

Decentralized source managers instead help the developer with the tedious effort of tracking individual changes, because the developer can commit without those changes being seen/used by anyone else. The commit instead gets sent when the developer is done with the feature.

This is why I prefer git to Subversion. git allows me to easily work with my 7±2 short-term memory limitations, by allowing me to commit "probably working but not fully tested" code so that I don't need to review those changes at the next cvs diff for the current problem I'm working on.

Posted on 23 Sep 2007 | Path: /development/openoffice.org/ | Permalink

Comparing Java and C# Generics

Or, What's Wrong With Java Generics?

What Are Generics

Java 5.0 and C# 2.0 have both added Generics, which permit a multitude of things:

  1. Improved compiler-assisted checking of types.
  2. Removal of casts from source code (due to (1)).
  3. In C#, performance advantages (discussed later).

This allows you to replace the error-prone Java code:

List list = new ArrayList ();
list.add ("foo");
list.add (new Integer (42));  // added by "mistake"

for (Iterator i = list.iterator (); i.hasNext (); ) {
    String s = (String) i.next (); 
       // ClassCastException for Integer -> String
    // work on `s'
    System.out.println (s);
}

with the compiler-checked code:

// constructed generic type
List<String> list = new ArrayList<String> ();
list.add ("foo");
list.add (42); // error: cannot find symbol: method add(int)
for (String s : list)
    System.out.println (s);

The C# equivalent code is nigh identical:

IList<string> list = new List<string> ();
list.Add ("foo");
list.Add (42); // error CS1503: Cannot convert from `int' to `string'
foreach (string s in list)
    Console.WriteLine (s);

Terminology

A Generic Type is a type (classes and interfaces in Java and C#, as well as delegates and structs in C#) that accepts Generic Type Parameters. A Constructed Generic Type is a Generic Type with Generic Type Arguments, which are Types to actually use in place of the Generic Type Parameters within the context of the Generic Type.

For simple generic types, Java and C# have identical syntax for declaring and using Generic Types:

class GenericClass<TypeParameter1, TypeParameter2>
{
    public static void Demo ()
    {
        GenericClass<String, Object> c = 
            new GenericClass<String, Object> ();
    }
}

In the above, GenericClass is a Generic Type, TypeParameter1 and TypeParameter2 are Generic Type Parameters for GenericClass, and GenericClass<String, Object> is a Constructed Generic Type with String as a Generic Type Argument for the TypeParameter1 Generic Type Parameter, and Object as the Generic Type Argument for the TypeParameter2 Generic Type Parameter.

It is an error in C# to create a Generic Type without providing any Type Arguments. Java permits creating Generic Types without providing any Type Arguments; these are called raw types:

Map rawMap = new HashMap <String, String> ();

Java also permits you to leave out Generic Type Arguments from the right-hand-side. Both raw types and skipping Generic Type Arguments elicit a compiler warning:

Map<String, String> correct = new HashMap<String, String> ();
    // no warning, lhs matches rhs
Map<String, String> incorrect = new HashMap ();
    // lhs doesn't match rhs; generates the warning:
    //  Note: gen.java uses unchecked or unsafe operations.
    //  Note: Recompile with -Xlint:unchecked for details.

Compiling the above Java code with -Xlint:unchecked produces:

gen.java:9: warning: [unchecked] unchecked conversion
found   : java.util.HashMap
required: java.util.Map<java.lang.String,java.lang.String>
                Map<String, String> incorrect = new HashMap ();

Note that all "suspicious" code produces warnings, not errors, under Java. Only provably wrong code generate compiler errors (such as adding an Integer to a List<String>).

(Also note that "suspicious" code includes Java <= 1.4-style use of collections, i.e. all collections code that predates Java 5.0. This means that you get lots of warnings when migrating Java <= 1.4 code to Java 5.0 and specifying -Xlint:unchecked.)

Aside from the new use of `<', `>', and type names within constructed generic type names, the use of generic types is essentially identical to the use of non-generic types, though Java has some extra flexibility when declaring variables.

Java has one added wrinkle as well: static methods of generic classes cannot reference the type parameters of their enclosing generic class. C# does not have this limitation:

class GenericClass<T> {
    public static void UseGenericParameter (T t) {}
        // error: non-static class T cannot be 
        // referenced from a static context
}

class Usage {
    public static void UseStaticMethod () {
        // Valid C#, not valid Java
        GenericClass<int>.UseGenericParameter (42);
    }
}

Generic Methods

Java and C# both support generic methods, in which a (static or instance) method itself accepts generic type parameters, though they differ in where the generic type parameters are declared. Java places the generic type parameters before the method return type:

class NonGenericClass {
    public static <T> T max (T a, T b) {/*...*/}
}

while C# places them after the method name:

class NonGenericClass {
    static T Max<T> (T a, T b) {/*...*/}
}

Generic methods may exist on both generic- and non-generic classes and interfaces.

Constraints

What can you do with those Generic Type Parameters within the class or method body? Not much:

class GenericJavaClass<T> {
    T[] arrayMember  = null;
    T   singleMember = null;

    public static void Demo ()
    {
        T localVariable = 42; // error
        T localVariable2 = null;

        AcceptGenericTypeParameter (localVariable);
    }

    public static void AcceptGenericTypeParameter (T t)
    {
        System.out.println (t.toString ()); // ok
        System.out.println (t.intValue ()); 
            // error: cannot find symbol
    }
}

class GenericCSharpClass<T> {
    T[] arrayMember  = null;
    T   singleMember = default(T);

    public static void Demo ()
    {
        T localVariable  = 42; // error
        T localVariable2 = default(T);

        AcceptGenericTypeParameter (localVariable);
    }

    public static void AcceptGenericTypeParameter (T t)
    {
        System.out.println (t.ToString ());     // ok
        System.out.println (t.GetTypeCode ());  // error: cannot find symbol
    }
}

So how do we call non-Object methods on objects of a generic type parameter?

  1. Cast the variable to a type that has the method you want (and accept the potentially resulting cast-related exceptions).
  2. Place a constraint on the generic type parameter. A constraint is a compile-time assertion that the generic type argument will fulfill certain obligations. Such obligations include the base class of the generic type argument, any implemented interfaces of the generic type argument, and (in C#) whether the generic type argument's type has a default constructor, is a value type, or a reference type.

Java Type Constraints

Java type and method constraints are specified using a "mini expression language" within the `<' and `>' declaring the generic type parameters. For each type parameter that has constraints, the syntax is:

TypeParameter ListOfConstraints

Where ListOfConstraints is a `&'-separated list of one of the following constraints:

(`&' must be used instead of `,' because `,' separates each generic type parameter.)

The above constraints also apply to methods, and methods can use some additional constraints described below.

class GenericClass<T extends Number & Comparable<T>> {
    void print (T t) {
        System.out.println (t.intValue ()); // OK
    }
}

class Demo {
    static <U, T extends U>
    void copy (List<T> source, List<U> dest) {
        for (T t : source)
            dest.add (t);
    }

    static void main (String[] args) {
        new GenericClass<Integer>().print (42);
            // OK: Integer extends Number
        new GenericClass<Double>().print (3.14159);
            // OK: Double extends Number
        new GenericClass<String>().print ("string");
            // error: <T>print(T) in gen cannot be applied 
            //  to (java.lang.String)

        ArrayList<Integer> ints = new ArrayList<Integer> ();
        Collections.addAll (ints, 1, 2, 3);
        copy (ints, new ArrayList<Object> ());
            // OK; Integer inherits from Object
        copy (ints, new ArrayList<String> ());
            // error: <U,T>copy(java.util.List<T>,
            //  java.util.List<U>) in cv cannot be 
            //  applied to (java.util.ArrayList<java.lang.Integer>,
            //  java.util.ArrayList<java.lang.String>)
    }
}

C# Constraints

C# generic type parameter constraints are specified with the context-sensitive where keyword, which is placed after the class name or after the method's closing `)'. For each type parameter that has constraints, the syntax is:

where TypeParameter : ListOfConstraints

Where ListOfConstraints is a comma-separated list of one of the following constraints:

class GenericClass<T> : IComparable<GenericClass<T>>
    where T : IComparable<T>
{
    private GenericClass () {}

    void Print (T t)
    {
        Console.WriteLine (t.CompareTo (t));
            // OK; T must implement IComparable<T>
    }

    public int CompareTo (GenericClass<T> other)
    {
        return 0;
    }
}

class Demo {
    static void OnlyValueTypes<T> (T t) 
        where T : struct
    {
    }

    static void OnlyReferenceTypes<T> (T t) 
        where T : class
    {
    }

    static void Copy<T, U> (IEnumerable<T> source, 
            ICollection<U> dest)
        where T : U, IComparable<T>, new()
        where U : new()
    {
        foreach (T t in source)
            dest.Add (t);
    }

    static T CreateInstance<T> () where T : new()
    {
        return new T();
    }

    public static void Main (String[] args)
    {
        new GenericClass<int>.Print (42);
            // OK: Int32 implements IComparable<int>
        new GenericClass<double>.Print (3.14159);
            // OK: Double implements IComparable<double>
        new GenericClass<TimeZone>.Print (
            TimeZone.CurrentTimeZone);
            // error: TimeZone doesn't implement 
            //  IComparable<TimeZone>

        OnlyValueTypes (42);    // OK: int is a struct
        OnlyValueTypes ("42");
            // error: string is a reference type

        OnlyReferenceTypes (42);
            // error: int is a struct
        OnlyReferenceTypes ("42");  // OK

        CreateInstance<int> ();
            // OK; int has default constructor
        CreateInstance<GenericClass<int>> ();
            // error CS0310: The type `GenericClass<int>' 
            //  must have a public parameterless constructor
            //  in order to use it as parameter `T' in the 
            //  generic type or method 
            //  `Test.CreateInstance<T>()'

        // In theory, you could do `Copy' instead of 
        // `Copy<...>' below, but it depends on the 
        // type inferencing capabilities of your compiler.
        Copy<int,object> (new int[]{1, 2, 3}, 
            new List<object> ());
            // OK: implicit int -> object conversion exists.
        Copy<int,AppDomain> (new int[]{1, 2, 3}, 
            new List<AppDomain> ());
            // error CS0309: The type `int' must be 
            //  convertible to `System.AppDomain' in order 
            //  to use it as parameter `T' in the generic 
            //  type or method `Test.Copy<T,U>(
            //      System.Collections.Generic.IEnumerable<T>, 
            //      System.Collections.Generic.ICollection<U>)'
    }
}

Java Wildcards (Java Method Constraints)

Java has additional support for covarient- and contravariant generic types on method declarations.

By default, you cannot assign an instance of one constructed generic type to an instance of another generic type where the generic type arguments differ:

// Java, though s/ArrayList/List/ for C#
List<String> stringList = new ArrayList<String> ();
List<Object> objectList = stringList; // error

The reason for this is quite obvious with a little thought: if the above were permitted, you could violate the type system:

// Assume above...
stringList.add ("a string");
objectList.add (new Object ());
// and now `stringList' contains a non-String object!

This way leads madness and ClassCastExceptions. :-)

However, sometimes you want the flexibility of having different generic type arguments:

static void cat (Collection<Reader> sources) throws IOException {
    for (Reader r : sources) {
        int c;
        while ((c = r.read()) != -1)
            System.out.print ((char) c);
    }
}

Many types implement Reader, e.g. StringReader and FileReader, so we might want to do this:

Collection<StringReader> sources = 
    new Collection<StringReader> ();
Collections.addAll (sources, 
    new StringReader ("foo"), 
    new StringReader ("bar"));
cat (sources);
// error: cat(java.util.Collection<java.io.Reader>) 
//  in gen cannot be applied to 
//  (java.util.Collection<java.io.StringReader>)

There are two ways to make this work:

  1. use Collection<Reader> instead of Collection<StringReader>:
    Collection<Reader> sources = new Collection<Reader> ();
    Collections.addAll (sources, 
        new StringReader ("foo"), 
        new StringReader ("bar"));
    cat (sources);
  2. Use wildcards.

Unbounded Wildcards

If you don't care about the specific generic type arguments involved, you can use `?' as the type parameter. This is an unbounded wildcard, because the `?' can represent anything:

static void printAll (Collection<?> c) {
    for (Object o : c)
        System.out.println (o);
}

The primary utility of unbounded wildcards is to migrate pre-Java 5.0 collection uses to Java 5.0 collections (thus removing the probably thousands of warnings -Xlint:unchecked produces) in the easiest manner.

This obviously won't help for cat (above), but it's also possible to "bind" the wildcard, to create a bounded wildcard.

Bounded Wildcards

You create a bounded wildcard by binding an upper- or lower- bound to an unbounded wildcard. Upper bounds are specified via extends, while lower bounds are specified via super. Thus, to allow a Collection parameter that accepts Reader instances or any type that derives from Reader:

static void cat (Collection<? extends Reader> c) 
    throws IOException
{
    /* as before */
}

This permits the more desirable use:

Collection<StringReader> sources = 
    new Collection<StringReader> ();
Collections.addAll (sources, 
    new StringReader ("foo"), 
    new StringReader ("bar"));
cat (sources);

Bounded wildcards also allow you to reduce the number of generic parameters you might otherwise want/need a generic method; compare this Demo.copy to the previous Java Demo.copy implementation:

class Demo {
    static <T> void copy (List<? extends T> source, 
        List<? super T> dest)
    {
        for (T t : source)
            dest.add (t);
    }
}

C# Equivalents

C# has no direct support for bounded or unbounded wildcards, and thus doesn't permit declaring class- or method-level variables that make use of them. However, if you can make the class/method itself generic, you can create equivalent functionality.

A Java method taking an unbounded wildcard would be mapped to a generic C# method with one generic type parameter for each unbound variable within the Java method:

static void PrintAll<T> (IEnumerable<T> list) {
    foreach (T t in list) {
        Console.WriteLine (t);
    }
}

This permits working with any type of IEnumerable<T>, e.g. List<int> and List<string>.

A Java method taking an upper bounded wildcard can be mapped to a generic C# method with one generic type parameter for each bound variable, then using a derivation constraint on the type parameter:

static void Cat<T> (IList<T> sources)
    where T : Stream
{
    for (Stream s : sources) {
        int c;
        while ((c = r.ReadByte ()) != -1)
            Console.Write ((char) c);
    }
}

A Java method taking a lower bounded wildcard can be mapped to a generic C# method taking two generic type parameters for each bound variable (one is the actual type you care about, and the other is the super type), then using a derivation constraint between your type variables:

static void Copy<T,U> (IEnumerable<T> source, 
        ICollection<U> dest)
    where T : U
{
    foreach (T t in source)
        dest.Add (t);
}

Generics Implementation

How Java and C# implement generics has a significant impact on what generics code can do and what can be done at runtime.

Java Implementation

Java Generics were originally designed so that the .class file format wouldn't need to be changed. This would have meant that Generics-using code could run unchanged on JDK 1.4.0 and earlier JDK versions.

However, the .class file format had to change anyway (for example, generics permits you to overload methods based solely on return type), but they didn't revisit the design of Java Generics, so Java Generics remains a compile-time feature based on Type Erasure.

Sadly, you need to know what type erasure is in order to actually write much generics code.

With Type Erasure, the compiler transforms your code in the following manner:

  1. Generic types (classes and interfaces) retain the same name, so you cannot have a generic class Foo and a non-generic Foo<T> in the same package -- these are the same type. This is the raw type.
  2. All instances of generic types become their corresponding raw type. So a List<String> becomes a List. (Thus all "nested" uses of generic type parameters -- in which the generic type parameter is used as a generic type argument of another generic type -- are "erased".)
  3. All instances of generic type parameters in both class and method scope become instances of their closest matching type:
    • If the generic type parameter has an extends constraint, then instances of the generic type parameter become instances of the specified type.
    • Otherwise, java.lang.Object is used.
  4. Generic methods also retain the same name, and thus there cannot be any overloading of methods between those using generic type parameters (after the above translations have occurred) and methods not using generic type parameters (see below for example).
  5. Runtime casts are inserted by the compiler to ensure that the runtime types are what you think they are. This means that there is runtime casting that you cannot see (the compiler inserts the casts), and thus generics confer no performance benefit over non-generics code.

For example, the following generics class:

class GenericClass<T, U extends Number> {
    T tMember;
    U uMember;

    public T getFirst (List<T> list) {
        return list.get (0);
    }

    // in bytecode, this is overloading based on return type
    public U getFirst (List<U> list) {
        return list.get (0);
    }

    //
    // This would be an error -- doesn't use generic type parameters
    // and has same raw argument list as above two methods:
    // 
    //  public Object getFirst (List list) {
    //      return list.get (0);
    //  }
    //

    public void printAll (List<U> list) {
        for (U u : list) {
            System.out.println (u);
        }
    }
}

Is translated by the compiler into the equivalent Java type:

class GenericClass {
    Object tMember;
    Number uMember; // as `U extends Number'

    public Object getFirst (List list) {
        return list.get (0);
    }

    public Number getFirst (List list) {
        // note cast inserted by compiler
        return (Number) list.get (0);
    }

    public void printAll (List list) {
        for (Iterator i=list.iterator (); i.hasNext (); ) {
            // note cast inserted by compiler
            Number u = (Number) i.next ();
            System.out.println (u);
        }
    }
}

.NET Implementation

.NET adds a number of new instructions to its intermediate language to support generics. Consequently generics code cannot be directly used by languages that do not understand generics, though many generic .NET types also implement the older non-generic interfaces so that non-generic languages can still use generic types, if not directly.

The extension of IL to support generics permits type-specific code generation. Generic types and methods can be constructed over both reference (classes, delegates, interfaces) and value types (structs, enumerations).

Under .NET, there will be only one "instantiation" (JIT-time code generation) of a class which will be used for all reference types. (This can be done because (1) all reference types have the same representation as local variables/class fields, a pointer, and (2) generics code has a different calling convention in which additional arguments are implicitly passed to methods to permit runtime type operations.) Consequently, a List<string> and a List<object> will share JIT code.

No additional implicit casting is necessary for this code sharing, as the IL verifier will prevent violation of the type system. It is not Java Type Erasure.

Value types will always get a new JIT-time instantiation, as the sizes of value types will differ. Consequently, List<int> and List<short> will not share JIT code.

Currently, Mono will always generate new instantiations for generic types, for both value and reference types (i.e. JIT code is never shared). This may change in the future, and will have no impact on source code/IL. (It will impact runtime performance, as more memory will be used.)

However, there are still some translations performed by the compiler, These translations have been standardized for Common Language Subset use, though these specific changes are not required:

Runtime Environment

Generics implementations may have some additional runtime support.

Java Runtime Environment

Java generics are a completely compile-time construct. You cannot do anything with generic type parameters that rely in any way on runtime information. This includes:

In short, all of the following produce compiler errors in Java:

static <T> void genericMethod (T t) {
    T newInstance = new T (); // error: type creation
    T[] array = new T [0];    // error: array creation

    Class c = T.class;        // error: Class querying

    List<T> list = new ArrayList<T> ();
    if (list instanceof List<String>) {}
        // error: illegal generic type for instanceof
}
Array Usage

The above has some interesting implications on your code. For example, how would you create your own type-safe collection (i.e. how is ArrayList<T> implemented)?

By accepting the unchecked warning -- you cannot remove the warning. Fortunately you'll only see the warning when you compile your class, and users of your class won't see the unchecked warnings within your code. There are two ways to do it, the horribly unsafe way and the safe way.

The horribly unsafe way works for simple cases:

static <T> T[] unsafeCreateArray (T type, int size) {
    return (T[]) new Object [size];
}

This seems to work for typical generics code:

static <T> void seemsToWork (T t) {
    T[] array = unsafeCreateArray (t, 10);
    array [0] = t;
}

But it fails horribly if you ever need to use a non-generic type:

static void failsHorribly () {
    String[] array = unsafeCreateArray ((String) null, 10);
    // runtime error: ClassCastException
}

The above works if you can guarantee that the created array will never be cast to a non-Object array type, so it's useful in some limited contexts (e.g. implementing java.util.ArrayList), but that's the extent of it.

If you need to create the actual runtime array type, you need to use java.lang.reflect.Array.newInstance and java.lang.Class<T>:

static <T> T[] safeCreateArray (Class<T> c, int size) {
    return (T[])java.lang.reflect.Array.newInstance(c,size);
}

static void actuallyWorks () {
	String[] a1 = safeCreateArray(String.class, 10);
}

Note that this still generates a warning by the compiler, but no runtime exception will occur.

C# Runtime Environment

.NET provides extensive runtime support for generics code, permitting you to do everything that Java doesn't:

static void GenericMethod<T> (T t)
    where T : new()
{
    T newInstance = new T (); // OK - new() constraint.
    T[] array = new T [0];    // OK

    Type type = typeof(T);    // OK

    List<T> list = new List<T> ();
    if (list is List<String>) {} // OK
}

C# also has extensive support for querying generic information at runtime via System.Reflection, such as with System.Type.GetGenericArguments().

What C# doesn't support is non-default constructor declaration, non-interface or base-type method declaration, and static method declaration. Since operator overloading is based on static methods, this means that you cannot generically use arithmetic unless you introduce your own interface to perform arithmetic:

// This is what I'd like:
class Desirable // NOT C#
{
    public static T Add<T> (T a, T b)
        where T : .op_Addition(T,T)
    {
        return a + b;
    }
}

// And this is what we currently need to do:
interface IArithmeticOperations<T> {
    T Add (T a, T b);
    // ...
}

class Undesirable {
    public static T Add<T> (IArithmeticOperations<T> ops, 
        T a, T b)
    {
        return ops.Add (a, b);
    }
}

Summary

The generics capabilities in Java and .NET differ significantly. Syntax wise, Java and C# generics initially look quite similar, and share similar concepts such as constraints. The semantics of generics is where they differ most, with .NET permitting full runtime introspection of generic types and generic type parameters in ways that are obvious in their utility (instance creation, array creation, performance benefits for value types due to lack of boxing) and completely lacking in Java.

In short, all that Java generics permit is greater type safety with no new capabilities, with an implementation that permits blatant violation of the type system with nothing more than warnings:

List<String> stringList = new ArrayList<String> ();
List rawList            = stringList;
    // only triggers a warning
List<Object> objectList = rawList;
    // only triggers a warning
objectList.add (new Object ());
for (String s : stringList)
    System.out.println (s); 
        // runtime error: ClassCastException due to Object.

This leads to the recommendation that you remove all warnings from your code, but if you try to do anything non-trivial (apparently typesafe arrays is non-trivial), you get into scenarios where you cannot remove all warnings.

Contrast this with C#/.NET, where the above code isn't possible, as there are no raw types, and converting a List<string> to a List<object> would (1) require an explicit cast (as opposed to the complete lack of casts in the above Java code), and (2) generate an InvalidCastException at runtime from the explicit cast.

Furthermore, C#/.NET convey additional performance benefits due to the lack of required casts (as the verifier ensures everything is kosher) and support for value types (Java generics don't work with the builtin types like int), thus removing the overhead of boxing, and C# permits faster, more elegant, more understandable, and more maintainable code.

Posted on 31 Aug 2007 | Path: /development/ | Permalink

Problems with Traditional Object Oriented Ideas

I've been training with Michael Meeks, and he gave Hubert and I an overview of the history of OpenOffice.org.

One of the more notable comments was the binfilter module, which is a stripped-down copy of StarOffice 5.2 (so if you build it you wind up with an ancient version of StarOffice embedded within your current OpenOffice.org build).

Why is a embedded StarOffice required? Because of mis-informed "traditional" Object Oriented practice. :-)

Frequently in program design, you'll need to save state to disk and read it back again. Sometimes this needs to be done manually, and sometimes you have a framework to help you (such as .NET Serialization). Normally, you design the individual classes to read/write themselves to external storage. This has lots of nice benefits, such as better encapsulation (the class doesn't need to expose it's internals), the serialization logic is in the class itself "where it belongs," etc. It's all good.

Except it isn't. By tying the serialization logic to your internal data structures, you severely reduce your ability to change your internal data structures for optimization, maintenance, etc.

Which is why OpenOffice.org needs to embed StarOffice 5.2: the StarOffice 5.2 format serialized internal data structures, but as time went on they wanted to change the internal structure for a variety of reasons, The result: they couldn't easily read or write their older storage format without having a copy of the version of StarOffice that generated that format.

The take away from this is that if you expect your software to change in any significant way (and why shouldn't you?), then you should aim to keep your internal data structures as far away from your serialization format as possible. This may complicate things, or it may require "duplicating" code (e.g. your real data structure, and then a [Serializable] version of the "same" class -- with the data members but not the non-serialization logic -- to be used when actually saving your state), but failure to do so may complicate future maintenance.

(Which is why Advanced .NET Remoting suggests thinking about serialization formats before you publish your first version...)

Posted on 28 Aug 2007 | Path: /development/openoffice.org/ | Permalink

Re-Introducing monodocer

In the beginning... Mono was without documentation. Who needed it when Microsoft had freely available documentation online? (That's one of the nice things about re-implementing -- and trying to stay compatible with -- a pre-existing project: reduced documentation requirements. If you know C# under .NET, you can use C# under Mono, by and large, so just take an existing C# book and go on your way...)

That's not an ideal solution, as MSDN is/was slow. Very slow. Many seconds to load a single page slow. (And if you've ever read the .NET documentation on MSDN where it takes many page views just to get what you're after... You might forget what you're looking for before you find it.) A local documentation browser is useful.

Fortunately, the ECMA 335 standard comes to the rescue (somewhat): it includes documentation for the types and methods which were standardized under ECMA, and this documentation is freely available and re-usable.

The ECMA documentation consists of a single XML file (currently 7.2MB) containing all types and type members. This wasn't an ideal format for writing new documentation, so the file was split up into per-type files; this is what makes up the monodoc svn module (along with many documentation improvements since, particularly types and members that are not part of the ECMA standard.

However, this ECMA documentation import was last done many years ago, and the ECMA documentation has improved since then. (In particular, it now includes documentation for many types/members added in .NET 2.0.) We had no tools to import any updates.

Monodocer

Shortly after the ECMA documentation was originally split up into per-type files, Mono needed a way to generate documentation stubs for non-ECMA types within both .NET and Mono-specific assemblies. This was (apparently) updater.exe.

Eventually, Joshua Tauberer created monodocer, which both creates ECMA-style documentation stubs (in one file/type format) and can update documentation based on changes to an assembly (e.g. add a new type/member to an assembly and the documentation is updated to mention that new type/member).

By 2006, monodocer had (more-or-less) become the standard the generating and updating ECMA-style documentation, so when I needed to write Mono.Fuse documentation I used monodocer...and found it somewhat lacking in support for Generics. Thus begins my work on improving monodocer.

monodocer -importecmadoc

Fast-forward to earlier this year. Once monodocer could support generics, we could generate stubs for all .NET 2.0 types. Furthermore, ECMA had updated documentation for many core .NET 2.0 types, so...what would it take to get ECMA documentation re-imported?

This turned out to be fairly easy, with supported added in mid-May to import ECMA documentation via a -importecmadoc:FILENAME parameter. The problem was that this initial version was slow; quoting the ChangeLog, "WARNING: import is currently SLOW." How slow? ~4 Minutes to import documentation for System.Array.

This might not be too bad, except that there are 331 types in the ECMA documentation file, documenting 3797 members (fields, properties, events, methods, constructors, etc.). 4 minutes per type is phenominally slow.

Optimizing monodocer -importecmadoc

Why was it so slow? -importecmadoc support was originally modeled after -importslashdoc support, which is as follows: lookup every type and member in System.Reflection order, create an XPath expression for this member, and execute an XPath query against the documentation we're importing. If we get a match, import the found node.

The slowdown was twofold: (1) we loaded the entire ECMA documentation into a XmlDocument instance (XmlDocument is a DOM interface, and thus copies the entire file into memory), and (2) we were then accessing the XmlDocument randomly.

The first optimization is purely algorithmic: don't import documentation in System.Reflection order, import it in ECMA documentation order. This way, we read the ECMA documentation in a single pass, instead of randomly.

As is usually the case, algorithmic optimizations are the best kind: it cut down the single-type import from ~4 minutes to less than 20 seconds.

I felt that this was still too slow, as 20s * 331 types is nearly 2 hours for an import. (This is actually faulty reasoning, as much of that 20s time was to load the XmlDocument in the first place, which is paid for only once, not for each type.) So I set out to improve things further.

First was to use a XPathDocument to read the ECMA documentation. Since I wasn't editing the document, I didn't really need the DOM interface that XmlDocument provides, and some cursory tests showed that XPathDocument was much faster than XmlDocument for parsing the ECMA documentation (about twice as fast). This improved things, cutting single-type documentation import from ~15-20s to ~10-12s. Not great, but better.

Convinced that this still wasn't fast enough, I went to the only faster XML parser within .NET: XmlTextReader, which is a pull-parser lacking any XPath support. This got a single-file import down to ~7-8s.

I feared that this would still need ~45 minutes to import, but I was running out of ideas so I ran a full documentation import for mscorlib.dll to see what the actual runtime was. Result: ~2.5 minutes to import ECMA documentation for all types within mscorlib.dll. (Obviously the ~45 minute estimate was a little off. ;-)

Conclusion

Does this mean that we'll have full ECMA documentation imported for the next Mono release? Probably not. There are still a few issues with the documentation import where it skips members that ideally would be imported (for instance, documentation for System.Security.Permissions.FileIOPermissionAttribute.All isn't imported because Mono provides a get accessor while ECMA doesn't). The documentation also needs to be reviewed after import to ensure that the import was successful (a number of bugs have been found and fixed while working on these optimizations).

Hopefully it won't take me too long to get things imported...

Posted on 15 Jul 2007 | Path: /development/mono/ | Permalink

Mono.Fuse 0.4.2

Mono.Fuse is a C# binding for FUSE. This is a minor update over the previous Mono.Fuse 0.4.1 release.

This is a minor release to fix configure support.

Aside: A Walk through Mono.Posix History

As mentioned in the Mono.Fuse 0.1.0 release, one of the side-goals was to make sure that Mono.Unix.Native was complete enough to be usable. One of the great discoveries was that it wasn't, which led to the addition of some new NativeConvert methods.

However, Mono.Fuse and the new Mono.Posix development were concurrent, and in getting the new NativeConvert methods added some of them were dropped. Originally, there would be 4 methods to convert between managed and native types:

This is what Mono.Fuse 0.2.1 and later releases assumed, and they used the NativeConvert.Copy methods.

Unfortunately, it was felt that having 4 methods/type (Stat, Statvfs, Utimbuf, Pollfd, Timeval...) would add a lot of new methods, so Mono.Posix only accepted the TryCopy variants, and not the Copy variants.

This implicitly broke Mono.Fuse, but I unfortunately didn't notice. Combined with a configure check that only checked whether libMonoPosixHelper.so exported one of the required underlying copy functions, most people didn't notice it either (as the installed libMonoPosixHelper.so didn't have the exports, so the check always failed, causing Mono.Fuse to use it's fallback methods).

Now that newer Mono releases are available, the configure check does find the libMonoPosixHelper.so exports, so it tries to use the NativeConvert methods...and triggers a compilation error, as the methods it's trying to use don't exist.

Mea culpa.

Download

Mono.Fuse 0.4.2 is available from http://www.jprl.com/Projects/mono-fuse/mono-fuse-0.4.2.tar.gz. It can built with Mono 1.1.13 and later. Apple Mac OS X support has only been tested with Mono 1.2.3.1.

GIT Repository

The GIT repository for Mono.Fuse is at http://www.jprl.com/Projects/mono-fuse.git.

Posted on 29 Jun 2007 | Path: /development/mono.fuse/ | Permalink

POSIX Says The Darndest Things

make check was reported to be failing earlier this week, and Mono.Posix was one of the problem areas:

1) MonoTests.Mono.Unix.UnixGroupTest.ListAllGroups_ToString : #TLAU_TS:
Exception listing local groups: System.IO.FileNotFoundException: Nie ma
takiego pliku ani katalogu ---> Mono.Unix.UnixIOException: Nie ma
takiego pliku ani katalogu [ENOENT].
  at Mono.Unix.UnixMarshal.ThrowExceptionForLastError () [0x00000] in
/home/koxta/mono-1.2.4/mcs/class/Mono.Posix/Mono.Unix/UnixMarshal.cs:456
  at Mono.Unix.UnixGroupInfo.GetLocalGroups () [0x0001c] in
/home/koxta/mono-1.2.4/mcs/class/Mono.Posix/Mono.Unix/UnixGroupInfo.cs:127
  at MonoTests.Mono.Unix.UnixGroupTest.ListAllGroups_ToString ()
[0x0000a] in
/home/koxta/mono-1.2.4/mcs/class/Mono.Posix/Test/Mono.Unix/UnixGroupTest.cs:32
  at MonoTests.Mono.Unix.UnixGroupTest.ListAllGroups_ToString ()
[0x0003c] in
/home/koxta/mono-1.2.4/mcs/class/Mono.Posix/Test/Mono.Unix/UnixGroupTest.cs:37
  at <0x00000> <unknown method>
  at (wrapper managed-to-native)
System.Reflection.MonoMethod:InternalInvoke (object,object[])
  at System.Reflection.MonoMethod.Invoke (System.Object obj,
BindingFlags invokeAttr, System.Reflection.Binder binder,
System.Object[] parameters, System.Globalization.CultureInfo culture)
[0x00040] in
/home/koxta/mono-1.2.4/mcs/class/corlib/System.Reflection/MonoMethod.cs:144

Further investigation narrowed things down to Mono_Posix_Syscall_setgrent() in support/grp.c:

int
Mono_Posix_Syscall_setgrent (void)
{
	errno = 0;
	setgrent ();
	return errno == 0 ? 0 : -1;
}

I did this because setgrent(3) can fail, even though it has a void return type; quoting the man page:

Upon error, errno may be set. If one wants to check errno after the call, it should be set to zero before the call.

Seems reasonably straightforward, no? Clear errno, do the function call, and if errno is set, an error occurred.

Except that this isn't true. On Gentoo and Debian, calling setgrent(3) may set errno to ENOENT (no such file or directory), because setgrent(3) tries to open the file /etc/default/nss. Consequently, Mono.Unix.UnixGroupInfo.GetLocalGroups reported an error (as can be seen in the above stack trace).

Further discussion with some Debian maintainers brought forth the following detail: It's only an error if it's a documented error. So even though setgrent(3) set errno, it wasn't an error because ENOENT isn't one of the documented error values for setgrent(3).

"WTF!," says I.

So I dutifully go off and fix it, so that only documented errors result in an error:

int
Mono_Posix_Syscall_setgrent (void)
{
	errno = 0;
	do {
		setgrent ();
	} while (errno == EINTR);
	mph_return_if_val_in_list5(errno, EIO, EMFILE, ENFILE, ENOMEM, ERANGE);
	return 0;
}

...and then I go through the rest of the MonoPosixHelper code looking for other such erroneous use of errno and error reporting. There are several POSIX functions with void return types that are documented as generating no errors, and others are like setgrent(3) where they may generate an error.

It's unfortunate that POSIX has void functions that can trigger an error. It makes binding POSIX more complicated than it should be.

Posted on 29 Jun 2007 | Path: /development/mono/ | Permalink

Mono.Fuse 0.4.1

Now with MacFUSE support!

Mono.Fuse is a C# binding for FUSE. This is a minor update over the previous Mono.Fuse 0.4.0 release.

The highlight for this release is cursory MacFUSE support, which allows Mono.Fuse to work on Mac OS X. Unfortunately, it's not complete support, and I would appreciate any assistance in fixing the known issues (details below).

Mac OS X HOWTO

To use Mono.Fuse on Mac OS X, do the following:

  1. Download and install Mono 1.2.3.1 or later. Other releases can be found at the Mono Project Downloads Page.
  2. Download and install MacFUSE 0.