This simple trick will cut your Word document down to size.
Word documents can suffer from bloat. An author tried to submit a 32-page Word file to us. The formatting was straightforward, but the file was a whopping 2.4 MB. Our on-line systems rejected the file (because why would we want a file that big?) The author saved it as a Rich Text Format (RTF) file of 600 kB and uploaded it. We imported it into Word and saved it as a .docx and low and behold, it was back to nearly 3 MB. Over 2 MB bigger without the addition of so much as a single comma. We reviewed the file and added some simple mark-up, and it blew up to well over 3MB.
After that, I applied this trick to the script and got it down to less than 90 kB – two orders of magnitude smaller. (And with five minutes work, it went to 80 kB.) So what’s going on?
The frank answer is that I’m not sure. I know some of the causes, but Word is a complex tool, so attributing anything to a single cause is dubious, and I’m trying to approach this as a user, not as a product tester. Broadly, there are two issues. Firstly, Word tracks changes. Even if you declare an edition to be final, and stop tracking changes, Word seems not to discard the change data. It’s still hanging around somewhere, even though it’s not used. Secondly, if your document is edited by multiple users (or one user on several computers), it picks up template information from each instance without discarding the previous information, so it keeps adding unused data to the document. (There may also be an issue with using different versions of Word to edit one document, and certainly further issues with editing documents in a mix of Word and other word-processors.)
The solution is to leave all the rubbish behind:-
Create a new blank document (preferably using a clean template that includes just the Styles you need). Now open your bloated document. Select all the contents (either by mouse or use a shortcut; Ctrl-A on a Windows computer). Click copy (Ctrl-C). Switch to your new blank document and Paste (Ctrl-V). Then Save. The new version will have left most of the dross behind and kept your text, your formatting and not a lot else.
(There is a minor additional tweak: that process will copy over all the Styles from the source document, including ones that are not actually in use. You can reduce the file size a little more by deleting unused Styles.)
Postscript – What If That Wasn’t My Problem
The other major cause of Word Bloat is embedded images. If you need pictures, you need them, but consider cropping and shrinking to a size appropriate for your purpose before you embed images in your document.
Once again, I find myself drawn to bad language. As usual, the cause is e-mail or, rather, e-mail filtering; a recent customer newsletter was rejected by a small number of (school) e-mail systems on the grounds of profanity. It is not my intention to write offensive newsletters (they are mainly about new publications), so the compilation strategy is to avoid swearing. In cases where words only have vulgar meanings, this is easy. It gets harder, as I have mentioned before, where words have multiple meanings dependent on context. Filtering is not good at context. I am returning to this topic because the offending word was an odd one. I think the cause of the problem was the title of David Pemberton’s Dance with the Devil. Why is the devil banned from my communications? The question is whether or not “devil” constitutes profanity.
That may seem obvious. You could argue that the devil, being in opposition to God is, by definition, profane. However, that which is profane is not necessarily profanity. (Profane means ‘not sacred’ whereas profanity is swearing or other language that should be avoided in polite society.) It might also be argued that ‘devil’ is a religious concept: a personification of evil. But if you go to the source material, you will find relatively little about the devil in the Christian bible – mainly the temptation of Christ (by Satan) as described in three of the gospels, and various instances of “casting out devils” (describing demonic possession). This should not be such a surprise: Christianity is monotheistic, believing in one omnipotent god; any elevation of the devil beyond the occasional anthropomorphic personification of evil would be to recreate a dualistic system along the lines of Manichaeism (which held that the universe was a perpetual struggle between equal opposing forces of good and evil). So where do we get the notion of the devil as a consistent figure – the one with the horns and goat’s feet? Largely through a combination of later Christian mythologizing and mediaeval art. The former is a matter of joining biblical dots (notably from the books of Ezekiel, Isaiah and Revelations) to create a more coherent whole than appears in any of the sources. The second is a matter of laziness. In Anna Karenina, when Tolstoy said “All happy families are alike; each unhappy family is unhappy in its own way” he was talking lazy rubbish. All happy families are different, but it is much easier – more dramatic – to describe the myriad ways people make each other miserable than it is to depict happiness. Similarly, depicting the tortures of hell and the attendant demons is far easier than a dull depiction of the tranquillity of heaven.
So what we have is the over-elaboration of a metaphor. Does that constitute profanity? I don’t think so. You can’t discuss the religious concept unless you name it. I suppose that there is an argument to be made that representations of the devil (such as the 16th century one by Jacob de Backer shown here) constitute profanity, but it’s a pretty abstruse argument. Then we have the original source of my problem: ‘dance with the devil’ is a metaphor, not a literal depiction or instruction. Old Harry appears in similar expressions like ‘devil in the detail’ and nobody takes those as literal or offensive. (At least, I don’t know of anybody who does. Would anyone care to speak, for example, for the Plymouth Brethren in this respect? I pick on them as a group who take such things very seriously and much more prescriptively than most of society.)
So are there any instances where use of ‘devil’ constitutes profanity? Well yes. You can call someone a devil offensively. You can also tell them to go to the devil. These days those uses constitute a vanishingly small minority when compared to legitimate religious use and common metaphor. So filtering out e-mails that contain the word devil is every bit as lazy as the mediaeval depictions of the tortures of hell.
We are under attack.
There is an increasing volume of spam, mostly aimed at business e-mail addresses, carrying a malicious payload via an attached file. The attachment contain some executable element (usually a macro that runs when the file is opened). The worst of the malicious payloads are ransomware – hijacking the computer and locking the user out pending payment of a ransom.
We have four lines of defence. The first is e-mail filtering. It isn’t very good.
I just completed my tax return on Her Majesty’s Revenue and Customs web site. At the end of the process, HMRC sent me a confirmation e-mail, essentially just giving me a reference number, with a link to the HMRC web site. That confirmation e-mail was filtered out as junk, whereas the filtering was perfectly happy to let through an e-mail with this header:
Or a similar one, in which HMRC appear to have contracted their services overseas:
Automated filtering suffers from both false positives and undetected negatives. The second line of filtering is the user, who has to cope with messages like:
That e-mail address is more plausible than the HMRC spoofs but bears no relation to the person name or the supposed company. It is part of the bombardment of quasi-business e-mails, most of which have attachments disguised as financial instruments – invoices, statements of account and the like. The following is a better example; it spoofs a sender e-mail address consistently and the body of the e-mail takes the Ian Fleming approach, disguising the big lie in plausible levels of detail. (In this case, its biggest failing was that it was sent to a non-existent address and was therefore swept into our junk mail dungeon.)
In theory, there are two levels of security beyond the inbox that might still save us from the worst of the scams, but I never want to put those to the test – and there is something simple that business people can do to defeat the scammers.
The assumption made by the scammers is that the e-mail is coming into a busy financial office. The e-mail doesn’t contain enough information for the transaction to be recognisable and therefore the recipient will open the attachment to find out what it’s about. The e-mail is written as though there is a prior history, but that history is never specified.
All that is needed to defeat this – to prove that a business e-mail is genuine – is to have some common verifiable evidence of history in the body of the e-mail so that the provenance can be checked without opening the attachment.
So, if you send out e-mails with, for example, remittance advice notes attached, then make sure your subject line or e-mail body contain a verifiable reference to a purchase order or invoice number.
(And why you can’t tell a Scottish head teacher that a child has been naughty.)
Paul Roostercroft came about through a collision of two problems. As mentioned previously (We Will Hide Your Stuff), BT Business has a novel filtering system that hides e-mails that it regards as spam. No customer notification – they don’t even tell you that this filter exists unless you ask the right question – just hiding. In theory – the theory expounded by the helpful BT second-line support guy who gave me access to the hidden system – this junk mail filter uses a learning algorithm. That means that if you tell it that something isn’t spam, it is supposed to look at future mail for similar characteristics, and, on that basis, decide that the new mail isn’t spam either. It doesn’t work. No matter how many times I tell it that I want to receive the regular bulletins from the Ordnance Survey (I like maps), it decides they are junk, whereas it lets through plenty of advertising e-mails to which I’ve never subscribed.
Similarly with Paul’s e-mail. Paul is a playwright whose e-mail I wish to receive. BT wishes to prevent that. The only reason I can see for BT’s objection is that he has the venerable Anglo Saxon surname of Cockcroft. I assume that BT thinks that this name will offend my delicate sensibilities. No matter how many times I tell BTs system that I want his e-mails, they still get trapped in the hidden junk folder.
That brings me on to the other problem (Things You Can’t Say). If BT thinks Cockcroft will frighten the horses, I can expect the same treatment from other e-mail systems. How am I supposed to talk about Paul’s plays in our e-mail newsletter? My solution was euphemism – specifically borrowing the American euphemism for a male chicken.
I thought that the inclusion of Paul Roostercroft had been successful in rendering my e-mails filter-proof until I received a “bounce” message that stated:
“A mail from you to [the head teacher of a Scottish primary school] was stopped and quarantined because it contains objectionable content in line 40”
I thought that this might have been caused by “Puss-in-Boots”, but no. As far as I can see from scrutinising the e-mail, the naughty word in line 40 was, in fact, “naughty”.
There are chat rooms and forums for British expatriates working in the United States (and elsewhere). They all include the question: “What’s the [local] equivalent of BACS?”
BACS is a brilliant idea, but it has yet to reach America. “Bankers’ Automated Clearing Services” allows direct transfer between UK bank accounts using the recipient’s account number and the branch identifier (sort code). It’s monodirectional – you push money from your account to someone else’s, but, even though you know the account number, you can’t pull money in the opposite direction. In its latest incarnation, it is generally very fast. For the banks, it’s cheaper to operate than cheques (the customer and the computer do all the work). For the customer’s it is (generally) more convenient and secure – I for one have never had an electron lost in the mail.
Are you waiting for something? Have you taken a breath in anticipation? Okay, here it comes. However…
The two identifier fields (account number and branch code) have a fixed format. They are a prescribed part of the protocol. There are also two text fields, one used for the benefit of the sender, to identify the recipient, the other for the benefit of the recipient to identify the (reason for) the payment. Both fields are free-form text and both give hassle. This is at the nuisance level – the benefits far outweigh the niggles – but as a frequent user, I feel the frustration and the need to grizzle!
We would like our customers to use the second field to enter our order reference number. That number includes an underscore character, which is fine for some banks, but others block it. There are excellent reasons for “sanitising” customer input, and blocking some characters; however, I have never come across a good reason for blocking an underscore. (The “recipient” field also gets sanitised. My bank doesn’t like dots. It will cope with “Mr A Smith” but not “Mr A. Smith”.)
Furthermore, some banks make it hard to change the “reason” field once it has been set-up. Thus we get returning customers who appear to be paying for the same order multiple times.
The field that gives me the greatest problem is the “recipient” field. My bank encourages me to use that field to enter the recipient’s name – and logically that would be the name that appears on their bank account. However, the bank offers me a fixed length field that is insufficient for the purpose. I have a long list of authors who receive their royalties by BACS, but how can I be sure I’m paying the right person? If the field cuts off at 15 characters, how am I supposed to distinguish between Christopher McPherson and Christopher McPhee?
This is a feature of Word 2007 and Word 2010, but not (pre-ribbon) Word 2002.
Try the following steps.
Start a new document in Word 2007 or Word 2010.
Write a short sentence or headline.
Select your text, then change the font to your favourite fancy font, increase the font size and make it italic.
Select the text, then click on the expander in the bottom right hand corner of the Styles box on the home page of the ribbon. (That launches the pop-up Styles panel.)
At the bottom of the Styles panel, click on click on the New Style icon. This should create a new style from your fancy text, and prompt you to give it a name. Let’s call this style “Wanted”. Click OK to create it.
The name of your new style should now appear in the Styles panel.
From the “Options…” link at the bottom of the Styles panel, under the “Select Formatting to Show As Styles” heading, select “paragraph level formatting”. (That determines what shows-up in your Styles panel.)
Now go back to the short sentence that you’ve created in your “Wanted” style. Put the cursor somewhere in the middle of that sentence and press Ctrl-Return. That inserts a page break.
Did you spot what that last operation did? In addition to the page break, it added something to your Styles panel.
What it added depends on which version of Word you’re using (and possibly the phase of the moon). In Word 2010 it usually adds a new style called “After: <something descriptive of paragraph formatting>”. In Word 2007 it adds a new style that describes details of the “Wanted” style.
Is this necessary?
To prove that, use the Styles panel to select all instances of the new (“Unwanted”) style and then apply the “Wanted” style to them. Aside from the demise of the Unwanted style, nothing else happens in the document. The Unwanted style was unnecessary.
Why does this matter?
Well, the point of Styles is to keep control of your document – to ensure that everything that should have the same format does have the same format. To ensure that if you want to change the way particular parts of the document look, you can change the style – one style, one change – and the change will be applied consistently throughout the document. By spewing out unnecessary styles, Microsoft makes it harder to format documents consistently.
Warning: this post contains words that are forbidden in Derby.
I sent an e-mail about a school play script to a customer at a school in Derby. I received an automated reply that said:-
Offensive Words Lexicon Found the expression “bottomless” 1 times, at 2 points each, for an expression score of 2 points.
Total Message Score: 2 points.
The e-mail has been blocked and has not been delivered.
Now, I recognise that in some contexts, the word bottomless can have connotations of immorality, but in this case, the context was the title of Raymond Blakesley’s school play “Santa Claus and the Bottomless Sack”. E-mail filtering systems are good with words, but very bad with context. Unfortunately, context is important. In describing a play to a school, I can’t say that the adult roles are written to be performed by children, as “adult” has been hijacked to mean “pornographic”. Instead, I have to use the childish expression “grown up”. Even worse, I can’t say that a play is written for teenagers as “teen” is blocked because it is used to mean “nubile” (though not in the sense of “marriageable”, unless marriageable is a euphemism).
The final insult from the automated message from Derby was the footnote. It said
The views expressed in this email are personal and may not necessarily reflect those of Derby City Council
So the things I am not allowed to say are dictated by the personal opinions of an automaton.