作者归档：Xianyang's Blog

关于Xianyang's Blog

Hi，我是Xianyang，来自于Intel，一个关注大数据技术的Coder。

scala class 序列化

首先我们创建一个Person类:

class Person(name: String, id: Int) extends Serializale

这里的name和id更像是java里的构造函数参数，无法通过Person的对象之间访问name和id，实际上通过javap反编译后的代码类似于：

public class Person implements Serializale {
     public Person(String name, int id) {
     }
}

所以当我们对Person对象进行序列化时实际上name和id不会被写到OutputStream中。

但是当Person内有方法需要访问到name或者id时则会为name和id生产final的field

class Person(name: String, id: Int) extends Serializale {
    def func(): Unit = {
       println(s"${name} ${id}")
    }
}

这时反编译后代码为：

public class Person implements Serializale {
     private final String name;
     private final int id;
     public Person(String name, int id) {
        this.name = name;
        thid.id = id;
     }

     public void func() {
        ...
     }
}

我们对Person的field做点改变：

class Person(val name: String, val id: Int) extends Serializable

这时反编译后的代码类似于：

public class Person implements Serializale {
    private final String _name;
    private final int _id;
    public Person(String name, int id) {
        this._name = name;
        this._id = id;
    }
    
    public String name() {
        return _name;
    }
    
    public int id() {
        return _id;
    }
}

所以不同的是增加了两个访问方法，序列化时会写出两个field。

再做点改变：

class Person(var name: String, var id: Int) extends Serializable

反编译：

public class Person implements Serializale {
    private String _name;
    private int _id;
    public Person(String name, int id) {
        this._name = name;
        this._id = id;
    }
    
    public String name() {
        return _name;
    }
    
    public int id() {
        return _id;
    }
    
    public void name_=(String name) {
        this._name = name;
    }
    
    public void id_=(int id) {
        this._id = id;
    }
}

这次两个field不再是final修饰了，并且增加了两个set方法。现在更像是一个java的POJO了。

那对于一个lazy对象会怎么样呢？继续改造下：

class Person(name: String, id: Int) extends Serializable {
    @transient lazy val NAME = name
    @transient lazy val ID = id
}

在这里我们新加了两个lazy修饰的field 并且transient。

val p = new Person("test", 1)

val bytes = // serialize p
val p1 = // deserialize bytes into Person object
println(p1.NAME) // "test"
println(p1.ID) // 1

我们对Person对象进行序列化和反序列化，神奇的是即使两个field用transient修饰了，我们依然可以在反序列化后得到。我们看下反编译后的代码类似于如下：

public class Person implements Serializale {
    private final String _name;
    private final int _id;
    private transient String NAME;
    private transient int ID;
    private volatile boolean NAMEInitialized = false;
    private volatile boolean IDInitialized = false;
    public Person(String name, int id) {
        this._name = name;
        this._id = id;
    }
    
    public String NAME() {
        synchronized(this) {
            if (!NAMEInitialized) {
                NAME = _name;
                NAMEInitialized = true;
            }
        }
        
        return NAME;
    }
    
    public String ID() {
        synchronized(this) {
            if (!IDInitialized) {
                ID = _id;
                IDInitialized = true;
            }
        }
        
        return ID;
    }
}

实际上的代码和上述代码有点区别。我们可以看到lazy修饰的field被转换成了一个方法，而我们的transient修饰于内部的字段，但是我们依然可以通过方法访问到。所以我们只要把Person改造成下面这样就访问不到了：

class Person(@transient name: String, @transient id: Int) extends Serializable {
    @transient lazy val NAME = name
    @transient lazy val ID = id
}

Kryo reference 注意事项

发表回复

Kryo is a fast and efficient object graph serialization framework for Java. The goals of the project are speed, efficiency, and an easy to use API. The project is useful any time objects need to be persisted, whether to a file, database, or over the network.

如上所述，Kryo 是一个针对Java的高效序列化工具，在序列化和反序列 Java 对象时速度在某些方面更优于ProtoBuf (参考自wiki). 其被广泛应用于很多Java的框架中，Twitter Chill是对Kryo应用于Scala对象的补充。

序列化相互引用的对象

序列化对象也就必然需要序列化对象的引用，而对于相互引用的对象即 objectA reference to objectB, objectB reference to objectA, 由于对象之间相互引用形成一个环，此时的序列化就相对来说稍微有点复杂。

By default, each appearance of an object in the graph after the first is stored as an integer ordinal. This allows multiple references to the same object and cyclic graphs to be serialized. This has a small amount of overhead and can be disabled to save space if it is not needed.

以上Kryo中的做法，当kryo.setReferences(true)时，用一个数组来存储对象，只有首次出现的对象放在数组中，后续出现的对象只存储一个数组中位置即可，这减少了环形引用对象的序列化大小。但是对于不存在环形引用的对象这是不必要的，因为会增加部分开销。刚开始读这段时没太理解，特意在StackOverFlow上问了下，通过@JBNizet的解释豁然开朗。

自定义Serializer

Kryo提供了很多默认的Serializer, 但是有时往往难以满足需求，需要自定义Serializer。在自定义Serializer时需要注意的是：

The Kryo instance can be used to write and read nested objects. If Kryo is used to read a nested object in read()then kryo.reference() must first be called with the parent object if it is possible for the nested object to reference the parent object.

也就是说当存储在环型引用时，在反序列话时需要调用kryo.reference(obj)，参考以下例子：

People.java

public class People {
    String name;
    People friend;

    public People() {}

    public People(String name) {
        this.name = name;
    }

    public void setFriend(People p) {
        friend = p;
    }

    @Override
    public String toString() {
        return "Friend: " + friend.name;
    }
}

PeopleSerializer.java

public class PeopleSerializer extends Serializer<People>{
    @Override
    public void write(Kryo kryo, Output output, People object) {
        output.writeString(object.name);
        kryo.writeObject(output, object.friend);
    }

    @Override
    public People read(Kryo kryo, Input input, Class<People> type) {
        String name = input.readString();
        People p = new People(name);
        kryo.reference(p); // 如将此行注释，则type.friend为null，下面的Main会NullpointException
        People friend = kryo.readObject(input, People.class);
        p.setFriend(friend);
        return p;
    }
}

Main.java

public static void main(String[] args) {
        People tom = new People("Tom");
        People bob = new People("Bob");
        tom.setFriend(bob);
        bob.setFriend(tom);

        Kryo kryo = new Kryo();
        kryo.register(People.class, new PeopleSerializer());

        String path = "/Users/lxy/IdeaProjects/firstProject/data/people";
        Output output = null;
        Input input = null;
        try {
            OutputStream outputStream = new FileOutputStream(path);
            output = new Output(outputStream);
            kryo.writeObject(output, tom);

            output.flush();

            InputStream inputStream = new FileInputStream(path);
            input = new Input(inputStream);
            tom = kryo.readObject(input, People.class);
            System.out.println(tom);
            System.out.println(tom.friend);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } finally {
            if (output != null) {
                IOUtils.closeQuietly(output);
            }
            if (input != null) {
                IOUtils.closeQuietly(input);
            }
        }
    }

同时在序列化环形引用时需保证kryo.setReferences(true)，并且在SelfSerializer.read()中kryo.reference(obj)。

kryo.setReferences()对序列化的大小影响

同样是上文中例子，只是去除了friend只保留一个name。

Main.java

kryo.setReferences(false);
OutputStream outputStream = new FileOutputStream(path);
output = new Output(outputStream);
kryo.writeObject(output, tom);

output.flush();

File file = new File(path);
long size = file.length();
System.out.println(size); // kryo.setReference(true) size:4, kryo.setReference(false)          
                          // size:3

从上面的例子可以看出，当对象的引用中不存在环形引用时，调用kryo.setReference(false)还是很有必要的，并且由于查找对象引用数组获取对应的对象引用也相当于一个间接查询。

[1] https://github.com/EsotericSoftware/kryo

Scala ClassTag & TypeTag

发表回复

Scala 的 ClassTag 和 TypeTag一直都是用时查阅一下资料，用完又忘了，一直没有个记录和总结。今天做一个简单的总结，并且不断更新完善。

首先看官网给出的解释

scala.reflect.api.TypeTags#TypeTag. A full type descriptor of a Scala type. For example, a TypeTag[List[String]] contains all type information, in this case, of type scala.List[String].

scala.reflect.ClassTag. A partial type descriptor of a Scala type. For example, a ClassTag[List[String]] contains only the erased class type information, in this case, of type scala.collection.immutable.List. ClassTags provide access only to the runtime class of a type. Analogous to scala.reflect.ClassManifest.

也是就是说ClassTag保留的是泛型擦除后的信息，而TypeTag保留的是非泛型擦除后的信息，保留的信息更完整。

获取ClassTag & TypeTag（已知类型）

import scala.reflect._
val ct = classTag[String]

import scala.reflect.runtime.universe._
val tt = typeTag[Int]

查看下 classTag 和 typeTag 的源码：

def classTag[T](implicit ctag: ClassTag[T]) = ctag

def typeTag[T](implicit ttag: TypeTag[T]) = ttag

这些implicit参数是由编译器生成的，因此上述获取 ClassTag 和 TypeTag 的方法也可以采用如下的写法：

val ct = implicitly[ClassTag[Int]]
val tt = implicitly[TypeTag[Int]]

implicitly[T]的作用就是步骤类型为T的implicit参数并返回。implicitly的源码如下：

@inline def implicitly[T](implicit e: T) = e

获取运行时的ClassTag & TypeTag

对于运行时的对象获取ClassTag和TypeTag可以采用下列方法：

val list = List[Int](1, 2, 3)
def getClassTag[T: ClassTag](value: T) = {
  //implicitly[ClassTag[T]]
  classTag[T]
}
println(getClassTag(list)) // scala.collection.immutable.List

def getTypeTag[T: TypeTag](value: T) = {
  //implicitly[TypeTag[T]]
  typeTag[T]
}
println(getTypeTag(list)) // TypeTag[List[Int]]

通过上下文绑定，编译器会帮我们生成对应的隐式参数。

获取Class和Type对象

由ClassTag我们可以获得对应的Class对象，在Java中被称为类的类类型，可以ct.runtimeClass获得，获得的Class是泛型擦除后的类型，等价于classOf[T]。

由TypeTag我们可以过得对应的Type对象，Type对象相对于Class对象，保存的信息更为完整，是Scala鉴于Java的Class对象只能获得泛型擦除后类的信息的一个补充，通过tt.tpe获得，等价于typeOf[T]

ClassTag和TypeTag的作用

在Java中创建Collection类型的对象都必须提供明确的类型，而Scala中可以通过ClassTag实现创建“泛型数组”。由于ClassTag的保留有对象的类型信息，所以可以通过下列方法创建在编译期无法知道类型的数组：

scala> def mkArray[T : ClassTag](elems: T*) = Array[T](elems: _*)
mkArray: [T](elems: T*)(implicit evidence$1: scala.reflect.ClassTag[T])Array[T]

scala> mkArray(42, 13)
res0: Array[Int] = Array(42, 13)

scala> mkArray(List(1), List(2), List(3))
res1: Array[List[Int]] = Array(List(1), List(2), List(3))

scala> mkArray(List(1), List(2), List("1", "3"))
res2: Array[List[Any]] = Array(List(1), List(2), List(1, 3))

从上面也可以看到ClassTag保留的是泛型擦除后的信息。

而TypeTag由于保留更多的类型信息，因此更多的用于反射中。

参考：http://docs.scala-lang.org/overviews/reflection/typetags-manifests.html