Hi Folks

A week or so ago I posted a message about a "random train" Excel

spreadsheet I had created, and gave three examples of its output

(message #75066). Briefly, the spreadsheet generates a list of 40

boxcars chosen at random from a universe of cars which approximates

the U.S. boxcar fleet ownership in 1949. Experienced Excel users can

adapt the spreadsheet to create random trains of any desired length

using whatever universe they would like.

In message #75229 I described how I automated the spreadsheet by

running it 100,000 times. That is, it created 100,000 randomly

generated car lists, with 40 cars per list. The main purpose of this

simulation was to test whether the random train spreadsheet was

operating correctly; if it was, then over the long run, the average

proportions of the randomly generated cars should tend to the

proportions of the universe (they did). The simulation also recorded

the maximum number of cars generated during any of the 100,000

iterations (for each road). For example, the number of NYC cars in

the three lists in message #75066 was 4, 1 and 6. During the

simulation, there was at least one car list with 14 NYC cars. (In a

list of 40 cars that is proportional to 1949 national averages, one

should expect 4 from the NYC.)

Subsequently, I have refined the simulation so as to record the

entire distribution of cars for each road during a simulation of

100,000 car lists. For example, the national proportion of New Haven

boxcars was less than 1% in 1949; most random trains of 40 cars would

not have any NH cars, but sometimes there will be one or more. The

next list shows the frequency distribution of 0, 1, 2, … NH cars

generated by the simulation of 100,000 car lists (71,508 car lists

had 0 NH cars; 24,068 had 1 NH car, etc.):

0___71,508

1___24,068

2___3,963

3___421

4___37

5___3

These numbers can be converted to probabilities by dividing by

100,000. Thus the probability of a car list with 40 cars and none

from the NH is .715, 1 car = .241, 2 cars = .040, etc.

After examining the results of this simulation, it seemed to me that

the process of random car selection was much like the ball and urn

models I had learned about in my statistics classes umpteen years

ago: An urn has some red and white balls of a known proportion.

Reach in and grab a ball; if red, then record it as a "success", and

if white as a "failure"; replace the ball then repeat the process for

a certain number of times, say 40. What is the probability of 0

successes? Exactly 1 success? Exactly 2, 3, … ? These

probabilities are given by the binomial distribution. The next list

shows the binomial distribution for 0, 1, 2, … (multiplied by

100,000) for 40 trials and a "probability of success on each

trial".0084 = .84% (this is the national proportion of NH boxcars in

1949 that I used for my simulation).

0___71,483

1___24,098

2___3,960

3___423

4___33

5___2

Note the close correspondence of the simulation and the binomial

distributions in the two lists. This and the examination of other

simulation results convinced me that my process of random car

selection could be effectively modeled by the binomial distribution

(I also compared the Poisson distribution). If anyone would like a

copy of my simulation results, contact me off list.

To use the binomial distribution, all you need to specify is the

number of trials (read boxcars in a train) and the probability of

success on each trial (read proportion of cars of a particular

ownership or type). The proportions of cars can be national,

regional, or any other proportion you wish to use. You can make the

calculations with the aid of tables, programs such as Excel, or any

of several on-line calculators.

I should point out a key difference between my simulation model and

the real world: Just like cards, a train "has memory". This means

that once a car is removed from the population and placed in the

train, it cannot be placed again in the same train. Once the first

NH car is chosen with a probability of success on each trial of

6,012 / 719,349 (NH boxcars divided by national boxcars, 1949) the

probability of success on each trial for the next one changes to

6,011 / 719,348. This is the difference between sampling with

replacement (my simulation) and sampling without replacement (real

world). The binomial distribution also assumes sampling with

replacement.

One use for the binomial distribution is to test real world examples

for randomness. Again reaching back many years to my statistics

classes, I am reminded of the "null hypothesis": A researcher

discovers something interesting and suspects it is not merely

random. The null hypothesis is that it IS random, while the

alternative hypothesis is that it is not. The null hypothesis is

assumed to be true unless the researcher is 95% or 99% confident that

it is false (these are typical confidence levels).

The UP train with the large number of SP boxcars is an example. My

understanding is that this train had some 90 boxcars, 36 of which

were SP. In order to calculate the binomial distribution, we need to

know the number of "trials" (i.e., cars in the train, say 90) and

the "probability of success on each trial" (i.e., the proportion of

SP cars in the national fleet, say 4% = .04). From this you can find

the probability of a train with exactly 0, 1, 2, …, 36, … SP cars.

Or maybe not: It turns out that the probability of 36 or more cars

is so low that Excel cannot calculate it. For example a 90 boxcar

train with a "mere" 20 or more SP boxcars would occur only once in

every 19.5 billion trains. Conclusion: This train could not have

occurred by chance alone. (A friend of mine who has lived in Laramie

all his life – in particular the 1940s and 50s – describes these cars

as a "transfer run".)

Suppose that the 4% number is wrong; Tim Gilbert's data lists 4.9% SP-

Pac ownership in 1956. Let's be generous and make it 5%. Then a 90

car train would have 20 or more SP boxcars once in every 356 million

trains. (Tim's data are at "4060totalboxcarsUSownership.xls" in the

files section of this list.)

Rather than using the proportion of the national fleet, how about

giving more "weight" to SP cars on the UP because of the "connection"

between the two railroads, or because of nearness or whatever? Let's

say we "weight" the SP cars by a factor of two (Mike Brock suggests a

weight of 1.5). To apply the desired weight, multiply it by the

national proportion: e.g., 2 * 5% = 10%. Using a "probability of

success on each trial" of 10% and a 90 boxcar train we find that

Excel still cannot calculate it because the probability is too low (a

train with "only" 30 or more SP boxcars would occur once every 3

billion trains). Conclusion: No reasonable weighting will reproduce

the train actually observed – we must reject the null hypothesis.

That is, the observed train composition is not the result of chance

alone.

I suspect that if we begin applying the binomial distribution to real

world data we will find many cases in which we should reject the null

hypothesis of random car assignment. This does not imply that the

random assignment model should be ignored, of course; it simply means

that other factors (real world consists, photos, personal choice,

etc.) should also be considered. For example, we may find cases such

as transfer runs or large shippers where it makes sense to treat

blocks of cars as a unit and to assign these blocks, rather than the

individual cars, to trains.

Best wishes,

Larry Ostresh

Laramie, Wyoming