DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • 8 Data Anonymization Techniques to Safeguard User PII Data
  • Microservices: Quarkus vs Spring Boot
  • How To Become a 10x Dev: An Essential Guide
  • What ChatGPT Needs Is Context
  1. DZone
  2. Data Engineering
  3. Data
  4. F#: Word Count – A Somewhat Failed Attempt

F#: Word Count – A Somewhat Failed Attempt

Mark Needham user avatar by
Mark Needham
·
Dec. 18, 09 · News
Like (0)
Save
Tweet
Share
3.71K Views

Join the DZone community and get the full member experience.

Join For Free

I came across Zach Cox's word count problem via Sam Aaron and Ola Bini's twitter streams and I thought it'd be interesting to try it out in F# to see what the solution would be like.

The solution needs to count word frequencies from a selection of newsgroup articles.

I wanted to see if it was possible to write it in F# without using a map to keep track of how many of each word had been found.

My thinking was that I would need to keep all of the words found and then calculate the totals at the end.

After a bit of fiddling this is the version I ended up with:

word-count.fsx

#light
open System
open System.IO
open System.Text.RegularExpressions

let (|File|Directory|) path = if(Directory.Exists path) then Directory(path) else File(path)
let getFileSystemEntries path = Directory.GetFileSystemEntries path |> Array.to_list

let files path =
let rec inner fileSystemEntries files =
match fileSystemEntries with
| [] -> files
| File path :: rest -> inner rest (path :: files)
| Directory path :: rest -> inner (List.append rest (getFileSystemEntries path)) files
inner (getFileSystemEntries path) []

let downloadFile path =
use streamReader = new StreamReader(File.OpenRead path)
streamReader.ReadToEnd()

let words input= Regex.Matches(input, "\w+") |> Seq.cast |> Seq.map (fun (x:Match) -> x.Value.ToLower())

let wordCount = files >>
List.map downloadFile >>
List.map words >>
List.fold (fun acc x -> Seq.append acc x) Seq.empty >>
Seq.groupBy (fun x -> x) >>
Seq.map (fun (value, sequence) -> (value, Seq.length sequence))

let writeTo (path:string) (values:seq<string * int>) =
use writer = new StreamWriter(path)
values |> Seq.iter (fun (value,count) -> writer.WriteLine(value + " " + count.ToString()))

let startTime = DateTime.Now
let count = wordCount "Z:\\20_newsgroups"

printfn "Writing counts in alphabetical order"
count |> Seq.sort |> writeTo "C:\\results\\counts-alphabetical-fsharp.txt"

printfn "Writing counts in descending order"
count |> Seq.sortBy (fun (_, count) -> count * -1) |> writeTo "C:\\results\\counts-descending-fsharp.txt"

let endTime = DateTime.Now
printfn "Finished in: %d seconds" (endTime - startTime).Seconds

The problem is that this version results in a StackOverFlow exception when I try to execute it with all the newsgroup articles although it does work correctly if I select just one of the folders.

From what I can tell the exception happens on line 24 when I get the text out of each of the files and store it in the list.

I tried changing this bit of code so that instead of doing that I combined the 'words' and 'downloadFile' functions so that the whole string wouldn't be saved but this doesn't seem to help that much. The exception just ended up happening a bit further down.

I'm not sure if it's possible to make this work by making use of lazy collections – I'm not that familiar with those yet so I"m not sure how to do it – or if this approach is just doomed!

Despite the fact it doesn't quite work there were some interesting things I noticed while playing with this problem:

  • At one stage I was trying to deal with a list of sequences whereby I had a list of sequences of words for each of the newsgroup articles. I found this really difficult to reason about as I was writing a 'List.map' and then a 'Seq.map' inside that.

    I originally had the 'List.fold (fun acc x -> Seq.append acc x) Seq.empty' line happening later on in that composition of functions such that I grouped all the words and then counted how many there were before folding down into a single sequence.

    I realised this didn't make much sense and it would be much easier to just go to the sequence earlier on and make the code easier to follow.

  • I've previously written about wrapping .NET library calls and I was doing this quite a lot when I started writing this code.

    For example I had written a function called 'isDirectory' which wrapped 'Directory.Exists' which I wrote more out of habit than anything else but it doesn't really add much value. I think when we're talking about wrapping static methods this is probably always the case. It's when we want to call a method on a C# object that the wrapping approach can be helpful.

  • I quite like the Ruby way of writing to a file…
    open("counts-descreasing-ruby", "w") do |out|
      counts.sort { |a, b| b[1] <=> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
    end

    …so I thought I'd see what it would look like to change my 'writeTo' function to be more like that:

    let writeTo (path:string) (f: StreamWriter -> Unit) = 
        use writer = new StreamWriter(path)
        f writer
    writeTo "C:\\results\\counts-alphabetical-fsharp.txt" (fun out -> 
        count |> Seq.sort |> Seq.iter (fun (v,c) -> out.WriteLine(v + " " + c.ToString())))

    I'm not sure it reads as well as the original version – the writing to file seems to becomes more prominent in this version than the data being written to it.

If anyone has any ideas about how I can get this not to blow up that would be cool!

These are some of the other solutions that I've come across:

  • Lau B. Jensen comments on Zach's post and provides an additional version in Clojure
  • A version in Ruby by Sam Aaron
  • A version in Ioke by Sam Aaron
  • Ola Bini's slightly modified version of Sam Aaron's Ioke version
  • A version by Shot in Ruby
code style Sam (text editor) ACC (programming language) IT Data (computing) Strings twitter

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • 8 Data Anonymization Techniques to Safeguard User PII Data
  • Microservices: Quarkus vs Spring Boot
  • How To Become a 10x Dev: An Essential Guide
  • What ChatGPT Needs Is Context

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: